# Introduction

Goal of this course: learn how to program GPUs with Julia

Expected experience: minimal familiarity with Julia and GPU programming

Day 1: programming the GPU
- Introduction
- Programming models
- Profiling and optimization

Day 2: advanced topics
- Memory management
- Concurrent computing
- Low-level CUDA APIs

## Why GPU programming in Julia?

*Very briefly*: Julia is

- a **high level** language: easy to write and read
- **designed for performance**: by using careful abstractions, and a JIT compiler

This relatively unique combination makes it a great language for GPU programming.

## Why CUDA.jl?

Essentially, because it's the most mature and best optimized GPU back-end for Julia.

However, other back-ends are built on top of the same stack, and share a lot of the same infrastructure:

- GPUArrays.jl: vendor neutral array operations
- GPUCompiler.jl: shared infrastructure for kernel compilation
- LLVM.jl, Adapt.jl, etc

As a result, it should be possible to apply much of what you learn here to other back-ends as well. In order of maturity:

- AMDGPU.jl: for [ROCM-supported](https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html) GPUs on Linux
- oneAPI.jl: for most [Intel IGPs and dedicated GPUs](https://github.com/intel/compute-runtime#supported-platforms) on Linux
- Metal.jl: for Apple M-series GPUs on macOS

There's some other experimental back-ends; for more details, see [juliagpu.org](https://juliagpu.org/).

## Set-up

CUDA.jl is easy to install, just do `Pkg.add("CUDA")`. For this course, I've provided a pre-configured environment with all relevant packages:

In [1]:
using Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()


[32m[1m  Activating[22m[39m project at `~/course`
[33m[1m└ [22m[39m[90m@ ~/course/Manifest.toml:0[39m
[32m[1mPrecompiling[22m[39m project...
[32m  ✓ [39m[90mSIMDTypes[39m
[32m  ✓ [39m[90mUnPack[39m
[32m  ✓ [39m[90mRealDot[39m
[32m  ✓ [39m[90mCustomUnitRanges[39m
[32m  ✓ [39m[90mRangeArrays[39m
[32m  ✓ [39m[90mManualMemory[39m
[32m  ✓ [39m[90mIndirectArrays[39m
[32m  ✓ [39m[90mTensorCore[39m
[32m  ✓ [39m[90mStatsAPI[39m
[32m  ✓ [39m[90mIntervalSets[39m
[32m  ✓ [39m[90mPkgVersion[39m
[32m  ✓ [39m[90mIterTools[39m
[32m  ✓ [39m[90mDocStringExtensions[39m
[32m  ✓ [39m[90mRatios[39m
[32m  ✓ [39m[90mIfElse[39m
[32m  ✓ [39m[90mProgressMeter[39m
[32m  ✓ [39m[90mNaNMath[39m
[32m  ✓ [39m[90mLazyModules[39m
[32m  ✓ [39m[90mInflate[39m
[32m  ✓ [39m[90mIrrationalConstants[39m
[32m  ✓ [39m[90mComputationalResources[39m
[32m  ✓ [39m[90mCpuId[39m
[32m  ✓ [39m[90mMappedArrays[39m
[32m  ✓ [39m[90m

Afterwards, try `CUDA.versioninfo()` to see some details about the environment:

In [2]:
using CUDA
CUDA.versioninfo()


[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mPrecompiling CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]


CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 470.57.2, originally for CUDA 11.4

CUDA libraries: 
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 11.0.0+470.57.2

Julia packages: 
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.10.0-rc1
- LLVM: 15.0.7

Environment:
- JULIA_CUDA_MEMORY_POOL: none

1 device:
  0: Tesla P100-PCIE-16GB (sm_60, 15.897 GiB / 15.899 GiB available)


CUDA.jl bundles many CUDA libraries (from the core toolkit as well as external libraries), making them available as artifacts packaged using BinaryBuilder.jl. Selection of these libraries is tricky, as several compatibility requirements need to be taken into account:

- the local NVIDIA driver has a CUDA Toolkit compatibility level
- starting with CUDA 11.0, the driver is forwards-compatible ("Enhanced Compatibility")
- it may be possible to load a more recent driver library to raise the compatibility level ("Forward Compatibility")
- external libraries (CUDNN, CUTENSOR) may have builds for each CUDA Toolkit

For this course, this should be using CUDA 12.3 from artifacts (instead of the old CUDA 11.0 toolkit provided by the module system).

If you would need to disable use of artifacts, call `CUDA.set_runtime_version!(version; local_toolkit=true)` where `version` is the version of the local CUDA toolkit. In this configuration, you are responsible for ensuring compatibility between the driver, toolkit, and any libraries! `CUDA.versioninfo()` will then look as follows:

```julia-repl
julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.74.0, for CUDA 11.4
CUDA driver 11.4

...
```

If CUDA.jl doesn't manage to discover your local CUDA installation, you can try launching Julia with the environment variable `JULIA_DEBUG` set to `CUDA`. This will reveal the locations CUDA.jl searches in for toolkit libraries and binaries:

```
$ JULIA_DEBUG=CUDA \
  julia -e 'using CUDA; device()
┌ Debug: Trying to use local installation...
└ @ CUDA.Deps ~/.julia/packages/CUDA/YpW0k/deps/bindeps.jl:164
┌ Debug: Looking for CUDA toolkit via environment variables CUDA_PATH
└ @ CUDA.Deps ~/.julia/packages/CUDA/YpW0k/deps/discovery.jl:271
┌ Debug: Looking for binary nvdisasm in /opt/cuda
│   all_locations =
│    2-element Vector{String}:
│     "/opt/cuda"
│     "/opt/cuda/bin"
└ @ CUDA.Deps ~/.julia/packages/CUDA/YpW0k/deps/discovery.jl:147
┌ Debug: Found nvdisasm at /opt/cuda/bin/nvdisasm
└ @ CUDA.Deps ~/.julia/packages/CUDA/YpW0k/deps/discovery.jl:153
┌ Debug: Looking for library cudart, no specific version, in /opt/cuda
...
└ @ CUDA.Deps ~/.julia/packages/CUDA/YpW0k/deps/compatibility.jl:210
```

## A quick tour

CUDA.jl is a large package, containing a lot of functionality:

- wrappers for the CUDA driver, to manage devices, streams, etc
- wrappers for CUDA's libraries, such as cuBLAS, cuFFT, cuRAND, etc
- native kernel programming support: `@cuda`, intrinsics wrappers, a compiler, etc
- high-level functionality that builds on all of the above: `CuArray`, array operations, stdlib integrations, etc

As an introduction, I'll use a simple parallel operation that can be used to demonstrate all of the above: AXPY, or $z = \alpha x + y$.

A simple CPU implementation of AXPY using array operations would look as follows:

In [3]:
axpy!(z, a, x, y) = z .= a .* x .+ y

x = [1, 2]
y = [2, 3]
alpha = 4
z = similar(x)

axpy!(z, alpha, x, y)


2-element Vector{Int64}:
  6
 11

At the highest level of abstraction, we can use exactly the same array operations with CUDA.jl to run this on the GPU:

In [4]:
x = CuArray([1, 2])
y = CuArray([2, 3])
alpha = 4
z = similar(x)

axpy!(z, alpha, x, y)


2-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
  6
 11

This is one of the strenghts of Julia's GPU ecosystem: Generic array operations make it possible to write code that works on all kinds of inputs. Just changing the input type is sufficient to "port" an application to the GPU. Of course, this doesn't always work out perfectly, and sometimes you still need custom operations, but it's a great starting point.

Let's peel back one of the layers, and implement a custom kernel instead of re-using the `broadcast` implementation from CUDA.jl

In [5]:
function axpy_kernel!(z, a, x, y)
    function kernel(z, a, x, y)
        i = threadIdx().x
        if i ≤ length(z)
            @inbounds z[i] = a * x[i] + y[i]
        end
        return
    end

    @cuda threads=length(z) kernel(z, a, x, y)

    return z
end

x = CuArray([1, 2])
y = CuArray([2, 3])
alpha = 4
z = similar(x)
axpy_kernel!(z, alpha, x, y)


2-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
  6
 11

The kernel above is too simple though, as it ignores a crucial limitation: you cannot launch an unbounded number of threads, but need to respect the device limit and, if needed, launch multiple blocks. Let's use at most 256 threads, which should work on any GPU:

In [6]:
function axpy_kernel!(z, a, x, y)
    function kernel(z, a, x, y)
        i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
        if i ≤ length(z)
            @inbounds z[i] = a * x[i] + y[i]
        end
        return
    end

    kernel = @cuda launch=false kernel(z, a, x, y)
    threads = min(length(z), 256)
    blocks = cld(length(z), threads)
    kernel(z, a, x, y; threads, blocks)

    return z
end

x = CuArray([1, 2])
y = CuArray([2, 3])
alpha = 4
z = similar(x)
axpy_kernel!(z, alpha, x, y)


2-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
  6
 11

In fact, there's already a good reason to use this `saxpy_kernel!` version, as it's generates slightly better code (requiring fewer registers) than the fully generic `broadcast` implementation:

In [7]:
x = CUDA.rand(Float32, 4096, 4096)
y = CUDA.rand(Float32, 4096, 4096)
z = similar(x)
alpha = rand(Float32)
axpy_kernel!(z, alpha, x, y)   # warm-up
CUDA.@profile trace=true axpy_kernel!(z, alpha, x, y)


Profiler ran for 460.15 µs, capturing 4 events.

Host-side activity: calling CUDA APIs took 46.97 µs (10.21% of the trace)
┌────┬──────────┬──────────┬────────────────┐
│[1m ID [0m│[1m    Start [0m│[1m     Time [0m│[1m Name           [0m│
├────┼──────────┼──────────┼────────────────┤
│  2 │ 51.26 µs │[31m 46.73 µs [0m│[1m cuLaunchKernel [0m│
└────┴──────────┴──────────┴────────────────┘

Device-side activity: GPU was busy for 366.45 µs (79.64% of the trace)
┌────┬──────────┬───────────┬─────────┬────────┬──────┬─────────────────────────────────────────────────────────────────────────────┐
│[1m ID [0m│[1m    Start [0m│[1m      Time [0m│[1m Threads [0m│[1m Blocks [0m│[1m Regs [0m│[1m Name                                                                        [0m│
├────┼──────────┼───────────┼─────────┼────────┼──────┼─────────────────────────────────────────────────────────────────────────────┤
│  2 │ 91.79 µs │[31m 366.45 µs [0m│     256 │  65536 │    8 │[1m

In [8]:
axpy!(z, alpha, x, y)
CUDA.@profile trace=true axpy!(z, alpha, x, y)


Profiler ran for 650.17 µs, capturing 6 events.

Host-side activity: calling CUDA APIs took 33.86 µs (5.21% of the trace)
┌────┬─────────┬──────────┬────────────────┐
│[1m ID [0m│[1m   Start [0m│[1m     Time [0m│[1m Name           [0m│
├────┼─────────┼──────────┼────────────────┤
│  4 │ 65.8 µs │[31m 31.23 µs [0m│[1m cuLaunchKernel [0m│
└────┴─────────┴──────────┴────────────────┘

Device-side activity: GPU was busy for 554.8 µs (85.33% of the trace)
┌────┬──────────┬──────────┬─────────┬────────┬──────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│[1m ID [0m│[1m    Start [0m│[1m     Time [0m│[1m Threads [0m│[1m Blocks [0m│[1m Regs [0m│[1m Name                                                                                                   

Another way to compute AXPY is to use the cuBLAS library, which provides the `cublasXaxpy` functions. We've got those conveniently wrapped in CUDA.jl as `CUBLAS.axpy!`:

In [9]:
x = CuArray([1, 2])
y = CuArray([2, 3])
alpha = 4
CUBLAS.axpy!(length(y), alpha, x, y)


LoadError: MethodError: no method matching axpy!(::Int64, ::Int64, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})

[0mClosest candidates are:
[0m  axpy!(::Integer, ::Number, [91m::StridedCuArray{ComplexF16}[39m, [91m::StridedCuArray{ComplexF16}[39m)
[0m[90m   @[39m [35mCUDA[39m [90m~/CUDA/lib/cublas/[39m[90m[4mwrappers.jl:226[24m[39m
[0m  axpy!(::Integer, ::Number, [91m::StridedCuArray{Float16}[39m, [91m::StridedCuArray{Float16}[39m)
[0m[90m   @[39m [35mCUDA[39m [90m~/CUDA/lib/cublas/[39m[90m[4mwrappers.jl:221[24m[39m
[0m  axpy!(::Integer, ::Number, [91m::StridedCuArray{ComplexF32}[39m, [91m::StridedCuArray{ComplexF32}[39m)
[0m[90m   @[39m [35mCUDA[39m [90m~/CUDA/lib/cublas/[39m[90m[4mwrappers.jl:211[24m[39m
[0m  ...


Well, that immediately demonstrates one of the limitations of NVIDIA's vendor libraries: they only support a handful of types. This is one of the reasons that a native compiler is so valuable.

To test CUBLAS, let's use Float32 inputs:

In [10]:
x = CuArray([1f0, 2f0])
y = CuArray([2f0, 3f0])
alpha = 4f0
CUBLAS.axpy!(length(y), alpha, x, y)


2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
  6.0
 11.0

In [11]:
x = CUDA.rand(Float32, 4096, 4096)
y = CUDA.rand(Float32, 4096, 4096)
alpha = rand(Float32)
CUBLAS.axpy!(length(y), alpha, x, y)   # warm-up
CUDA.@profile trace=true CUBLAS.axpy!(length(y), alpha, x, y)


Profiler ran for 467.3 µs, capturing 6 events.

Host-side activity: calling CUDA APIs took 30.04 µs (6.43% of the trace)
┌────┬──────────┬───────────┬──────────────────┐
│[1m ID [0m│[1m    Start [0m│[1m      Time [0m│[1m Name             [0m│
├────┼──────────┼───────────┼──────────────────┤
│  2 │ 14.78 µs │ 715.26 ns │ cudaGetLastError │
│  3 │ 18.12 µs │[31m  28.61 µs [0m│[1m cudaLaunchKernel [0m│
│  4 │ 46.97 µs │    0.0 ns │ cudaGetLastError │
└────┴──────────┴───────────┴──────────────────┘

Device-side activity: GPU was busy for 420.57 µs (90.00% of the trace)
┌────┬──────────┬───────────┬─────────┬────────┬──────┬─────────────────────────────────────────────────────────┐
│[1m ID [0m│[1m    Start [0m│[1m      Time [0m│[1m Threads [0m│[1m Blocks [0m│[1m Regs [0m│[1m Name                                                    [0m│
├────┼──────────┼───────────┼─────────┼────────┼──────┼─────────────────────────────────────────────────────────┤
│  3 │ 44.58 µs │

Note that despite using the most registers, CUBLAS is faster than both our naive `axpy!` kernel, and the `broadcast` implementation. That's not unexpected, but at the same time, the difference isn't huge either, so it's always a good thing to benchmark your code!

Now, before moving on to the next notebook, be sure to stop or restart the kernel, as Piz Daint only allows a single process to use the GPU.