# Parallelisation

So far we were only concerned with performing **sequential computation**, i.e. one instruction after the next. But nowadays even the cheapest personal computers feature multiple processor cores (in some cases even multiple physical processors), such that an efficient implementation needs to be able to perform computation in parallel in order to be able to **make use of the power of multiple processor cores** at once.

Before we discuss (some) options for parallel programming in Julia, first some general remarks.



## Why is parallelisation needed?

About 20 years ago it became clear that **physical constraints** on CPU manifacture and operation would make it impossible to increase performance by focusing on single cores. This was the advent of multi-core CPUs.

<img src="img/42-years-processor-trend.png" width=700px>

(Image from https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/)


## Why is parallelisation hard?

Generally writing good parallel code is much harder than writing good sequential code. One indicator why this is the case is **Amdahl's law**:

- If the fraction $f$ of my code can be parallelised, the maximal theoretical speedup by employing $n$ cores is given by
$$ F(n) = \frac{1}{1 - f + \frac{f}{n}} $$
or graphically:

In [None]:
using Plots
F(f, n) = 1 / (1 - f + f / n)

p = plot(; xlabel="Number of cores", ylabel="Parallel speedup", legend=:topleft)
for f in reverse((0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.99))
    plot!(p, n -> F(f, n), 1:32, label="$(Int(100f))%", lw=2)
end
plot!(p, n -> F(1, n), 1:32, label="ideal", ls=:dash, color=:black)
p

* Note that even with 95% parallelisable code only a maximal speedup of around 9 is feasible.
* Generally **doubling** the number of cores leads to **far less than twice** the speedup.
* Getting a code to **scale to more than 10 processors** requires **serious work** ... and thus more involved code.
* More involved code means that it is more difficult to debug, switch algorithmic ideas etc.

**Therefore: At the center of an efficient parallel code is a fast serial code.**

## Types of parallelisation

We can distinguish multiple forms of parallelism, amongst these

* **Instruction level parallelism** (e.g. SIMD)
* **Multi-threading**  (shared memory, same process)
* **Multi-processing** (shared memory, different process)
* **Distributed parallelism** (Distributed memory and processes (e.g. across nodes)

Keeping the order, some ways to do [parallel computing](https://docs.julialang.org/en/v1/manual/parallel-computing/) in Julia:

* `@simd`, [SIMD.jl](https://github.com/eschnett/SIMD.jl), [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl)
* `Threads.@threads`, `Threads.@spawn`, [FLoops.jl](https://github.com/JuliaFolds/FLoops.jl), [ThreadsX.jl](https://github.com/tkf/ThreadsX.jl) ...
* `@spawnat`, `@fetch`, `RemoteChannel`, `SharedArray`
* `@spawnat`, `@fetch`, `RemoteChannel`, [DistributedArrays.jl](https://github.com/JuliaParallel/DistributedArrays.jl), [MPI.jl](https://github.com/JuliaParallel/MPI.jl)

**In this course:**
  * We already explored instruction-level parallelism as part of [16_Performance_Engineering.ipynb](16_Performance_Engineering.ipynb)
  * We will discuss threaded parallelism in [21_Multithreading_Basics.ipynb](21_Multithreading_Basics.ipynb) and MPI-based distributed parallelism in [22_MPI_Distributed_Parallelism.ipynb](22_MPI_Distributed_Parallelism.ipynb).