In this repo, we explore high-performance computing (HPC) parallism in Julia. After learning the basics, we apply what we learned to a real-world application: gpu-accelerating QuantumAnnealingTools.jl
OpenQuantumTools.jl
, a package for simulating open quantum systems. We start with a few users' guides explaining how to set up Julia on Discovery, the HPC cluster at USC. In particular, we show how to set up and use MPI and CUDA, how to install OpenQuantumTools, and how to edit OpenQuantumTools in "develop mode."
For this project, we did the following
- Wrote a simple code to calculate PI in Julia--our "hello world" of HPC just as we did in the CSCI 596 course (see here)
- Benchmarked the MPI implementation of PI, showing it's string and weak-scaling parallel efficiency (see here)
- Profiled the OpenQuantumTools package when annealing a quantum spin-glass Hamiltonian, identifying the scrhodinger equation solver as the bottleneck (see here)
- GPU-accelerated the solve_schrodinger bottleneck in OpenQuantumTools (see here and here)
- Bechmarked the performance of GPU-accelerated schrodinger equation solver, showing a speed-up for n = 8 up to n = 10 qubits (see bottom of this page)
We found this a successful first crack at HPC with Julia, but as usual, there is always room for future work. Here, our future direction is straighforward: integrate our changes into the OpenQuantumTools.jl package properly and then attempt to GPU-accelerate the open quantum systems solvers. In other words, we only accelerated solve_schrodinger, but we also want to accerlate solve_lindblad, solve_redfield, etc...
add julia package, (can also add to .bashrc like we did for MPI)
$ module load julia
Notice: QuantumAnnealingTools.jl
, Qtbase.jl
now are open source and change name to OpenQuantumTools.jl
, OpenQuantumBase.jl
.
Please follow their instruction for installation, if fail, can try with different julia package server like step 2.
-
put both
QuantumAnnealingTools.jl
andQtbase.jl
in local directory -
change package server
$ export JULIA_PKG_SERVER=pkg.julialang.org
-
start julia
$ julia --project=QuantumAnnealingTools.jl
-
manually add
QTBase.jl
in package mode (by pressing]
)
(QuantumAnnealingTools) pkg> add ~/path/to/QTBase.jl
-
run unit test (better run in compute nodes)
julia> Pkg.test()
or(QuantumAnnealingTools) pkg> test
Note:
-
if fail with message like below, please run
$ rm -rf ~/.julia/registries/General
(clean all registries) and try again from step 3.
ERROR: failed to clone from https://github.com/JuliaNLSolvers/NLSolversBase.jl.git, error: GitError(Code:ERROR, Class:Net, failed to resolve address for github.com: Name or service not known)
-
packages' loading and updating should run on the login node, since compute nodes do not have access to the internet.
see: https://juliaparallel.github.io/MPI.jl/stable/configuration/
-
add MPI package on julia (on login node)
(@v1.4) pkg> add MPI
-
build package with system MPI
$ julia --project -e 'ENV["JULIA_MPI_BINARY"]="system"; using Pkg; Pkg.build("MPI"; verbose=true)
-
run mpi julia script (on compute nodes)
$ mpiexec -n 4 julia --project pi_mpi.jl
pi_mpi.jl
-
add CUDA to .bashrc (or .modules)
module load cuda
-
add CUDA pkg on Julia (login node)
pkg> add CUDA
-
test with salloc (see using GPUs on Discovery )
salloc --ntasks=2 --time=30:00 --gres=gpu:k40:1 --account=anakano_429
julia> using CUDA
julia> u0 = cu(rand(1000))
-
use with DifferentialEquations (bonus if you want to try)
- guide to follow
- Note: Change
using OrdinaryDiffEq, CuArrays, LinearAlgebra
-->using DifferentialEquations, CUDA, LinearAlgebra
to use packages consistent with QuantumAnnealingTools - Note2: You can use
@time
tag with and without CUDA arraycu()
to show that CUDA speeds up the solution of the ODE
In this portion, we assume you have a local version of QuantumAnnealingTools located in some direcory that we'll call "localQAT." We'll walk through a specific example of adding a function called "solve_schrodinger_gpu" in localQAT/src/QSolver/closed_system_solvers.jl
-
add solve_schrodinger_gpu to closed_system_solver.jl
function solve_schrodinger_gpu(A::Annealing, tf::Real; tspan = (0, tf), kwargs...) u0 = cu(build_u0(A.u0, :v)) p = ODEParams(A.H, float(tf), A.annealing_parameter) update_func = function (C, u, p, t) update_cache!(C, p.L, p, p(t)) end cache = get_cache(A.H) diff_op = DiffEqArrayOperator(cache, update_func = update_func) jac_cache = similar(cache) jac_op = DiffEqArrayOperator(jac_cache, update_func = update_func) ff = ODEFunction(diff_op, jac_prototype = jac_op) prob = ODEProblem{true}(ff, u0, float.(tspan), p) solve(prob; alg_hints = [:nonstiff], kwargs...) end
-
add CUDA dependency and expose solve_schrodinger_gpu to user
Open the file localQAT/src/QuantumAnnealingTools.jl
- add the following lines
and
import CUDA: cu
export solve_schrodinger_gpu
where export can be added to list at the end with "solve_unitary," "solve_schrodinger," etc...
-
create a new Julia environment (I don't think it's possible to "override" local QuantumAnnealingTools using
free
command, so this it's easier to just make a new devlopment environment) `julia --project=QATest'- Then add necessary libraries on login node such as
pkg> add DifferentialEquations
pkg> add ~/path/to/QTBase.jl
pkg> add CUDA
- Finally, add localQAT in "dev" mode
(QATest) pkg> dev ~/path/to/localQAT
- Then add necessary libraries on login node such as
-
Verify that solve_schrodinger_gpu is accessible
julia> solve_schrodinger_gpu
which whould output the following if the function is loaded
solve_schrodinger_gpu (generic function with 1 method)
Note: If it doesn't work, try restarting Julia and loading again. You may have tried to load QuantumAnnealingTools before making the changes locally.
-
If you want to run this command, you'll have to load with GPU, so check out CUDA for Julia above
parallel efficiency
see detailed data HERE(including CPU info).
Notice: for the situation of one processor, mpi inter-processor communication did not happen, thus take less time then expect, and the data point is ignored in the plot.
Scalling on tf and n. As we can see the GPU acceleration get advantage when the input scale is large (n>=8). However it did not show obvious advantage when the time sequence (tf) is longer.
The test code can be find here accelqat/cuda/scaling_test.jl.
The test result and relevent CPU information can be find here accelqat/cuda/scaling_test_result/