In [2]:
ENV["LINES"] = 10;
ENV["COLUMNS"] = 80;

HTML("""
<style>
.reveal pre code {
    max-height: none;
    font-size: 90%;
}
.rise-enabled .text_cell {
    font-size: 150%;
}
</style>
""")

# Julia + Jupyter + GPU = ⚗️🔬🧬🥰

Marius Millea (Project Scientist @ UC Davis in Cosmology)

NERSC GPU Science Day, Oct 12, 2023

Thanks to: Tim Besard + CUDA.jl/Julia contributors, Johannes Blaschke, Rollin Thomas

I work on analyzing maps of the Cosmic Microwave Background. Using tiny distortions imprinted by gravitational lensing, we can make maps of where all the dark matter is in the universe. We do so by solving **millions-of-dimensional Bayesian inference problems.** 

<video controls autoplay loop muted width="1800" height="600" source src="kappa_forecast.mp4" type="video/mp4">

Our basic code building blocks are array broadcasts and FFTs, which is perfectly suited for GPU. Our group has been using GPUs since the Cori GPU testbed days.

But this talk is not about science, but instead **sharing the workflow we've developed over the last ~5 years.**

## Outline
* Julia + Jupyter + GPU motivation
* Julia CUDA Installation
* Basic and advanced Julia CUDA usage
* Multi-GPU workflows for embarrasingly parallel problems

## Motivation

* Julia
    * interactive but fast
    * powerful and flexible
    * less boilerplate: code looks like science

* Jupyter
    * convenient for interactive work
    * fast iterative development workflow

* GPU
    * duh

## Install

Julia/CUDA install is drop-dead simple. Julia's CUDA package provides compatible binary drivers:

```shell
$ curl -fsSL https://install.julialang.org | sh
$ julia
pkg> add CUDA # ~2min
   Resolving package versions...
   Installed CUDA_Driver_jll ── v0.6.0+3
   Installed LLVMExtra_jll ──── v0.0.26+0
   ...
   Installed CUDA ───────────── v5.0.0
 Downloading artifact: CUDA_Driver
```

(Easy to select CUDA version _per project_ with e.g. `CUDA.set_runtime_version!(v"11.4")`)

I recommend this fully native Julia install over using any `modules`, i.e. I don't even have the `gpu` module loaded:

In [None]:
; module list

This has proven robust across many clusters I've tried.

Checking everything is installed:

In [None]:
using CUDA

In [None]:
CUDA.versioninfo()

## Basic usage

In [None]:
arr = rand(10_000_000)

In [None]:
carr = cu(arr)

In [None]:
sin.(carr) .+ 1

Lets benchmark:

In [None]:
using BenchmarkTools

In [None]:
@btime CUDA.@sync sin.(carr) .+ 1;

In [None]:
@btime sin.(arr) .+ 1;

In [None]:
CUDA.@profile sin.(carr) .+ 1;

## Power of Julia (1)

In Julia, you can easily put many arbitrary objects on GPU:

In [None]:
struct Point{T}
    x :: T
    y :: T
end

In [None]:
arr = Point.(rand(100), rand(100))
carr = cu(arr)

In e.g. Jax/PyTorch/TF, the only things you can stick inside of CUDA arrays are Int/Float/Complex. In Julia, anything with a static memory layout is fine.

In [None]:
distance_from_origin(p::Point) = sqrt(p.x^2 + p.y^2)

In [None]:
distance_from_origin.(carr)

## Limitations

In [None]:
function distance_from_origin_bad(p::Point)
    sqrt(sum([p.x^2, p.y^2]))
end

In [None]:
distance_from_origin_bad.(carr)

Limitations on code in functions that will be compiled for GPU:
* No calls to CPU functions
   * E.g. creating Arrays (use StaticArrays.jl instead)
* No _dynamic dispatch_
   * Code should be _type stable_

## Power of Julia (2)

You can also directly write kernels in Julia, giving the full power and flexibility of CUDA kernel programming:

In [None]:
function my_kernel(carr_out, carr)   
    start = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    len = length(carr)
    for i = start:stride:len  # "grid-stride" loop
        carr_out[i] = sin(carr[i]) + 1
    end
    return
end

In [None]:
carr = cu(rand(10_000_000))
carr_out = similar(carr);

In [None]:
@cuda threads=256 my_kernel(carr_out, carr)

In [None]:
carr_out

See [Kernel Programming](https://cuda.juliagpu.org/stable/api/kernel/) for full list of CUDA.jl kernel programming capabilities.

## Multi-GPU (single node)

In [None]:
CUDA.devices()

In [None]:
CUDA.device()

In [None]:
CUDA.device!(1)

In [None]:
arr = rand(10_000_000)
carr = cu(arr)
@btime CUDA.@sync sin.(carr) .+ 1;

CUDA.jl does its own memory management, so before switching back to GPU 0, give back memory (don't usually have to think about this unless you use the same GPU from multiple processes, which for the purpose of this demo I do):

In [None]:
GC.gc()
CUDA.reclaim()

In [None]:
CUDA.device!(0)

You can use multiple GPUs via Julia processes, tasks, or threads. 

The most robust and easy way I have found (as of 2023), which I recommend starting with, is per-_process_:

In [None]:
using Distributed

In [None]:
addprocs(3)

In [None]:
@everywhere using CUDA, BenchmarkTools

In [None]:
@everywhere procs() println((myid(), CUDA.device()))

In [None]:
@everywhere procs() CUDA.device!(myid()-1)

In [None]:
@everywhere procs() println((myid(), CUDA.device()))

Lets run our benchmark in parallel across all GPUs:

In [None]:
let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:4) do i
        @btime CUDA.@sync sin.($carr) .+ 1
    end
end

Note, `carr` was defined and moved to GPU on the master process. Julia automatically sent it to the worker GPUs, then automatically sent the results back to the master GPU. 

In doing so, the array passed through CPU memory, so its not the most efficient (but its the easiest).

To go straight GPU-to-GPU, you can use _unified memory_ on a single-node, or CUDA MPI transport (later this talk).

## Multi-GPU (multiple nodes, elastic)

In [None]:
using ClusterManagers

In [None]:
em = ElasticManager(
    # Perlmutter specific ↓
    addr = IPv4(first(filter(!isnothing, match.(r"inet (.*)/.*hsn0", readlines(`ip a show`)))).captures[1]),
    port = 0
);

In [None]:
em

Now submit a job, e.g. with:
```bash
salloc -C gpu -q regular -t 00:30:00 --cpus-per-task 32  --gpus-per-task 1 --ntasks-per-node 4 --nodes 8 -A mp107
```
then run the "worker connect command" printed above (could also do all-in-one as a batch job).

With more GPUs across different nodes, its more complex to assign one unique GPU to each process. Instead we can use this utility function:

In [None]:
using CUDADistributedTools

In [None]:
CUDADistributedTools.assign_GPU_workers()

Let's run parallel benchmarks again:

In [None]:
@everywhere using CUDA, BenchmarkTools

In [None]:
let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:nprocs()) do i
        @btime CUDA.@sync sin.($carr) .+ 1
        return nothing
    end
end;

## Multi-GPU (multiple nodes, MPI)

Installing MPI for Julia and configuring:
```julia
pkg> add MPI MPIPreferences

julia> MPIPreferences.use_system_binary(;vendor="cray", mpiexec="srun") # <- options are Perlmutter specific

┌ Info: MPI implementation identified
│   libmpi = "libmpi_gnu_91.so"
│   version_string = "MPI VERSION    : CRAY MPICH version 8.1.25.17 (ANL base 3.4a2)\nMPI BUILD INFO : Sun Feb 26 15:15 2023 (git hash aecd99f)\n"
│   impl = "CrayMPICH"
│   version = v"8.1.25"
└   abi = "MPICH"
┌ Info: MPIPreferences changed
│   binary = "system"
│   libmpi = "libmpi_gnu_91.so"
│   abi = "MPICH"
│   mpiexec = "srun"
│   preloads =
│    1-element Vector{String}:
│     "libmpi_gtl_cuda.so"
└   preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
```

(This works thanks to among others NERSC's Johannes Blaschke's contributions to MPI.jl) 

You can put SLURM script and Julia script in one file 
`test_script.jl`:

```julia
#!/bin/bash
#SBATCH -C gpu -q regular -A mp107
#SBATCH -t 00:05:00 
#SBATCH --cpus-per-task 32 --gpus-per-task 1 --ntasks-per-node 4 --nodes 4
#=
srun /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit 0
# =#

using MPIClusterManagers, Distributed, CUDA, BenchmarkTools
mgr = MPIClusterManagers.start_main_loop(MPIClusterManagers.MPI_TRANSPORT_ALL)

let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:nprocs()) do i
        @btime CUDA.@sync sin.($carr) .+ 1
    end
end

MPIClusterManagers.stop_main_loop(mgr)

```

Then `sbatch test_script.jl`.

Here, movement of memory between GPUs will happen via CUDA MPI transport 🚀

## Multi-GPU (multiple nodes, MPI, notebooks)

### Some code in a notebook:

In [None]:
let
    carr = cu(rand(10_000_000))
    pmap(WorkerPool(procs()), 1:nprocs()) do i
        @btime CUDA.@sync sin.($carr) .+ 1
        return nothing
    end
end;

### Now use:

In [None]:
using ParameterizedNotebooks

In [None]:
nb = ParameterizedNotebook("talk.ipynb", sections=("Some code in a notebook:",))

In [None]:
nb()

You can put the call to the notebook code directly in a `test_script_2.jl`:
```julia
#!/bin/bash
#SBATCH -C gpu -q regular -A mp107
#SBATCH -t 00:05:00 
#SBATCH --cpus-per-task 32 --gpus-per-task 1 --ntasks-per-node 4 --nodes 4
#=
srun /global/u1/m/marius/.julia/juliaup/julia-1.9.3+0.x64.linux.gnu/bin/julia $(scontrol show job $SLURM_JOBID | awk -F= '/Command=/{print $2}')
exit 0
# =#

using MPIClusterManagers, Distributed, CUDA
mgr = MPIClusterManagers.start_main_loop(MPIClusterManagers.MPI_TRANSPORT_ALL)

nb = ParameterizedNotebook("talk.ipynb", sections=("Some code in a notebook:",))
nb()

MPIClusterManagers.stop_main_loop(mgr)
```

With some care in the organization of your sections, you can iterate on code in the notebook, even test it in parallel using on-the-fly `ElasticManager` workers, then submit the identical code as an MPI job for larger-scale runs 🎉

## Conclusions

* Julia + Jupyter + GPUs offer powerful scientific workflows
* Hopefully I've shared some efficient ways to do this that we've learned
* Wishlist
    * More robust and easier CUDA.jl task/threading support
    * An easy way to use MPI CUDA transport protocol from within Jupyter jobs
    * A _multi-node_ GPU monitor, even just a command-line one
        * `nvitop`, `btop` (PR), and `gpustat` are some good command line single-node options