# Running Native Julia Parallelism
## The Basics
### Jordan Jalving

### University of Wisconsin-Madison


## Starting a Parallel Session
There are two ways to start using Parallel processing in Julia.  
1. Start your Julia session with `~$ julia -p <n_workers>`
2. Run `addprocs(<n_workers>)` in a session

Since we are using an IJulia notebook, we will use the second option here to add our worker processes


In [2]:
#Add processors to your Julia session
addprocs(3) #Here, we add 3 workers
println("Processors in session = ",nprocs())

Processors in session = 4


In [3]:
#gets number of worker processes.  The process running Julia is not a workers
n_workers = nworkers() 
println("The number of worker processors is: $n_workers")

The number of worker processors is: 2


In [4]:
#Get array of worker processors
worker_ids = workers()
println("The worker ids are: ", worker_ids)

The worker ids are: [2, 3]


 --- 

## Distributed Parallelism: Remote Calls and Futures

Julia uses the concept of remote calls and futures to manage processing and communication. Think of it as if
each processor has its own memory domain and can send and request data from other processors.

We will showcase the basic functions using a simple 3x3 matrix

In [5]:
#This is how to make a random 3x3 matrix in Julia.  This creates the matrix on the local process.
rand(3,3)

3×3 Array{Float64,2}:
 0.907714  0.133574  0.360118
 0.401509  0.727763  0.982123
 0.101566  0.487098  0.179558

### Use `remotecall` to evaluate a function on a given process
`remotecall(<function>,<process_id>,[arguments])` is the low-level function to request work to be done on a given process.
For instance, we can ask processor 2 to create a random 3x3 matrix.

In [6]:
#Call the rand function on worker 2 with arguments (3,3)
#This creates a Future r on the local process
r = remotecall(rand, 2, 3, 3)

Future(2, 1, 4, Nullable{Any}())

r is a Future (i.e. a remote reference).  Processor 1 (our julia session) has a reference to the result which will be available on processor 2.

Use `fetch(r)` to get the value processor 2 calculated.  **NOTE:  This will move the data from processor 2 to processor 1 (overhead).**

In [7]:
#Get the value of r.  The local process will wait until the 
#value is ready (i.e. fetch is blocking)
fetch(r)

3×3 Array{Float64,2}:
 0.154911  0.654919  0.982367
 0.819902  0.502235  0.15779 
 0.24031   0.453198  0.62019 

--- 

### Use `@spawnat` to evaluate an expression on a given process
@spawnat is a macro, which means we can pass any valid expression as an argument.  It is the macro equivalent of `remotecall(<function>,<process_id>,[arguments])` and has the form `@spawnat <process_id> <expression>`
___

In [9]:
#run rand(3,3) command on processor 3
s1 = @spawnat 3 rand(3,3)

Future(3, 1, 6, Nullable{Any}())

In [10]:
#fetch the value the same way
fetch(s1)

3×3 Array{Float64,2}:
 0.990488  0.034919  0.537269
 0.622238  0.408424  0.659486
 0.48038   0.372676  0.30252 

#### NOTE: `fetch(<Remote Reference>)` can be used to move data from worker to worker

In [11]:
#Call rand(3,3) on processor 2
r = remotecall(rand, 2, 3, 3) 

#Spawn the operation `1 + fetch(r)` on process 3 by passing the remote reference r
s2 = @spawnat 3 1 + fetch(r) 

Future(3, 1, 9, Nullable{Any}())

In [13]:
#fetch the value from processor 2
println(fetch(r))
#Fetch the value from processor 3
println(fetch(s2))

[0.966091 0.488924 0.355472; 0.111911 0.528139 0.350022; 0.657613 0.0244478 0.128135]
[1.96609 1.48892 1.35547; 1.11191 1.52814 1.35002; 1.65761 1.02445 1.12813]


--- 

### Use `remotecall_fetch`  to obtain a remotely-computed value immediately

#### NOTE: This is equivalent to `fetch(remotecall())`, but it's more efficient.  This function is typically used inside loops to get values 

In [15]:
#run rand on process 2 with arguments (3,3).  Fetch the result immediately.
remotecall_fetch(rand, 2, 3, 3)

3×3 Array{Float64,2}:
 0.527964  0.377173   0.795317 
 0.472128  0.0793794  0.0377924
 0.953235  0.69017    0.92849  

In [14]:
#remotecall_fetch is the same as fetch(remotecall()), but usually more efficient
fetch(remotecall(rand,2,3,3))

3×3 Array{Float64,2}:
 0.743321  0.0782893  0.569942
 0.735892  0.930334   0.756292
 0.449399  0.445753   0.683127

--- 

### Use `@spawn` to make things easier.
`@spawn` is like `remotecall` and `@spawnat`, but Julia's task manager picks the processor to use

In [16]:
#Spawn the task rand(3,3).  Let Julia pick where to run it
s = @spawn rand(3,3)

Future(2, 1, 15, Nullable{Any}())

In [17]:
#Fetch the result the same way
fetch(s)

3×3 Array{Float64,2}:
 0.651879  0.159476   0.342553
 0.122161  0.0575064  0.757701
 0.550802  0.442897   0.892071

---
---

## Asynchronous Parallelism: Using @sync and @async to manage tasks
So far, we showed how to use `remotecall`,`@spawnat`, and `@spawn` to dispatch work to other processes.  In any useful parallel application however, we would want to manage communication between our workers.  This can be done using the `@async` and `@sync` macros.

### Use @sync and @async to manage asynchronous computing
`@async <expression>`: Create a task on the master process that runs the expression.  Julia will continue without waiting for `@async` to finish.

`@sync <expression>`: Wait for enclosed uses of `@async`,`@spawn`,`@spawnat` to finish.  Typically these to macros get used in the form:

```julia
@sync begin 
    for i = 1:n_workers
        @async begin #start a @async task on the Julia process
            #...
            #dispatch work to worker i (e.g. using remotecall_fetch())
            #...
        end
    end  #end for
end  #wait for all the @async tasks to finish
 ```

### @sync @async?  What's the difference?
These two macros often get confused.  Let's look at a couple examples to make their use more clear using 
some simple `sleep()` calls

In [54]:
#Sleep for 2 seconds
@time sleep(2)

  2.001065 seconds (78 allocations: 1.453 KiB)


#### Asynchrounous `sleep()`.  Remember, Julia will not wait for `@async` to complete

In [18]:
@time @async sleep(2)

Task (runnable) @0x00007fb733e5e770

  0.004119 seconds (371 allocations: 24.182 KiB)


In [19]:
#@sync will wait for the enclosed use of @async to finish
@time @sync @async sleep(2)

Task (done) @0x00007fb7340a7730

  2.009708 seconds (288 allocations: 18.438 KiB)


### The wrong way to schedule workers.  How long will this code take?  Why?

In [22]:
@time begin
    for (idx, pid) in enumerate(workers())
        println("Sending work to $pid")
        remotecall_fetch(sleep,pid, 2)
    end
end

Sending work to 2
Sending work to 3
  4.007086 seconds (1.66 k allocations: 93.094 KiB)
nothing


### The correct way to schedule.  Use @sync and @async

In [23]:
@time begin
    @sync for (idx, pid) in enumerate(workers())
        println("sending work to $pid")
        @async remotecall_fetch(sleep,pid, 2)
    end
end

sending work to 2
sending work to 3
  2.008865 seconds (3.55 k allocations: 204.437 KiB)


### What if we forget the @sync?

In [24]:
@time begin
    for (idx, pid) in enumerate(workers())
        println("sending work to $pid")
        @async remotecall_fetch(sleep,pid, 2)
    end
end

sending work to 2
sending work to 3
  0.005044 seconds (353 allocations: 22.461 KiB)


--- 
---

## A more interesting example: `pmap`
**Now that we ~~have mastered~~ hopefully understand the basics of Julia's parallel functions, pmap may seem less mysterious.  Here, we have a slightly modified version of the pmap function.**

In [15]:
function my_pmap(f, lst)  #apply the function f to arguments in lst
    np = nprocs()         # determine the number of processes available
    n = length(lst)       #lst is a list of n arguments.
    results = Vector{Any}(n)
    i = 1
    # function to produce the next work item from the queue.
    # in this case it's just an index.
    nextidx() = (idx=i; i+=1; idx)
    @sync begin
        #Create a feeder task (using @async) for each worker
        for p = workers()
            @async begin  
                while true #feeder task runs constantly
                    idx = nextidx()
                    if idx > n
                        break
                    end
                    println("Sent argument $idx to worker $p")
                    results[idx] = remotecall_fetch(f, p, lst[idx])
                end
            end
        end
    end
    results
end

my_pmap (generic function with 1 method)

#### Let's look at what happens when we want to do singular value decomposition on a list of matrices

In [16]:
@time begin 
    M = [rand(1000,1000) for _ = 1:20];
    my_pmap(svd,M);
end;

Sent argument 1 to worker 2
Sent argument 2 to worker 3
Sent argument 3 to worker 4
Sent argument 4 to worker 3
Sent argument 5 to worker 2
Sent argument 6 to worker 4
Sent argument 7 to worker 3
Sent argument 8 to worker 4
Sent argument 9 to worker 2
Sent argument 10 to worker 3
Sent argument 11 to worker 4
Sent argument 12 to worker 2
Sent argument 13 to worker 3
Sent argument 14 to worker 4
Sent argument 15 to worker 2
Sent argument 16 to worker 3
Sent argument 17 to worker 4
Sent argument 18 to worker 2
Sent argument 19 to worker 3
Sent argument 20 to worker 4
 18.937517 seconds (41.46 k allocations: 459.801 MiB, 1.57% gc time)


### How does the serial version compare?

In [23]:
@time begin 
    M = [rand(1000,1000) for _ = 1:20]
    map(svd,M)
end;

 20.294191 seconds (10.00 k allocations: 1.195 GiB, 5.08% gc time)


### Use `@everywhere` do define data on every process

In [25]:
#calculate the SVD of a random square matrix
@everywhere function calc_svd(i::Int)
    M = rand(i,i)
    return svd(M)
end

In [29]:
(U,E,V) = calc_svd(1000)


([-0.0313593 0.00778317 … 0.0520004 -0.00405403; -0.0319202 0.0210451 … 0.0177618 -0.0270165; … ; -0.0318672 0.00370634 … -0.0355034 -0.0115457; -0.0320372 -0.0460958 … 0.0297155 -0.00947919], [500.47, 18.205, 18.0379, 17.9666, 17.9175, 17.8067, 17.7276, 17.6885, 17.668, 17.5635  …  0.142454, 0.131418, 0.114749, 0.0991385, 0.0732386, 0.0599413, 0.0483915, 0.0235702, 0.0200018, 0.00157804], [-0.0314898 -0.00541011 … 0.0225108 0.0073545; -0.0314878 0.0344167 … 0.007495 0.0148601; … ; -0.0318607 -0.0217017 … 0.00383985 0.0394869; -0.0308898 0.0213347 … 0.00100465 0.047732])

**Here, we can use a custom function to tell Julia to create the matrices on the workers instead of the master process.  However, since SVD is more computationally intense, this is unlikely to produce much speedup**

In [27]:
@time my_pmap(calc_svd,[1000 for _ = 1:20]);

Sent argument 1 to worker 2
Sent argument 2 to worker 3
Sent argument 3 to worker 4
Sent argument 4 to worker 2
Sent argument 5 to worker 3
Sent argument 6 to worker 4
Sent argument 7 to worker 3
Sent argument 8 to worker 2
Sent argument 9 to worker 4
Sent argument 10 to worker 3
Sent argument 11 to worker 4
Sent argument 12 to worker 2
Sent argument 13 to worker 3
Sent argument 14 to worker 4
Sent argument 15 to worker 2
Sent argument 16 to worker 2
Sent argument 17 to worker 4
Sent argument 18 to worker 3
Sent argument 19 to worker 2
Sent argument 20 to worker 4
 18.700734 seconds (39.08 k allocations: 307.003 MiB, 5.25% gc time)


---
---

## Another example: Calculating Pi using Monte Carlo Sampling

![](MonteCarlo.png)

**We can numerically estimate $\pi$ by randomly generating points inside a quadrantand counting how many land inside the unit circle**

$A_{square} = 1$

$A_{circle} = \frac{\pi}{4}$

$\pi = 4 * \frac{A_{circle}}{A_{square}}$

For any uniformly distributed point with uniformly distributed random coordinates $(x,y) \sim U(0,1)$, a point is in the unit circle if 
$x^2 + y^2 \le 1$

$x \sim U(0,1)$

$y \sim U(0,1)$

$x^2 + y^2 \le 1$

We can then estimate pi as the ratio of points inside the circle over total points generated:

$ \pi \approx 4 * \frac{\text{N points inside circle}}{\text{N total}}$

### A reasonable serial implementation for Julia
#### Here, we loop through 1 billion points and check if each is in the unit circle

In [10]:
tic()
function in_circle(numPoints)
    N_inside = 0           #initially, no points in circle
    for i = 1:numPoints
        x = rand()
        y = rand()
        x^2 + y^2 < 1 ? N_inside += 1 : N_inside += 0
    end
    return N_inside
end
numPoints = 1E9
pi_approx = 4 * in_circle(numPoints) / numPoints
toc()
println("approximation = ",pi_approx)

elapsed time: 9.613884557 seconds
approximation = 3.141588104


## Exercise: Write a parallel code to estimate $\pi$

In [None]:
#
function in_circle(numPoints)
    N_inside = 0
    for i = 1:numPoints
        x = rand()
        y = rand()
        x^2 + y^2 < 1 ? N_inside += 1 : N_inside += 0
    end
    return N_inside
end
numPoints = 1E9
pi_approx = 4 * in_circle(numPoints) / numPoints

#One hint: This can be done with partitions
#e.g. partitions = [convert(Int64,round(numPoints / 3)) for i = 1:3]

---
---
---
---
---

## One Solution: Partitioned pmap implementation
#### with pmap, we can partition into 3 subsets and have our workers run their own loops 

In [1]:
tic()
@everywhere function in_circle(numPoints)
    N_inside = 0
    for i = 1:numPoints
        x = rand()
        y = rand()
        x^2 + y^2 < 1? N_inside += 1 : N_inside += 0
    end
    return N_inside
end

numPoints = 1E9
partitions = [convert(Int64,round(numPoints / 3)) for i = 1:3]
N_inside = pmap(in_circle, partitions)
pi_approx = 4*sum(N_inside) / numPoints
toc()
println(pi_approx)

elapsed time: 9.854903545 seconds
3.14155876


## Another solution: `@parallel for` ( This was replaced with `@distributed for` in Julia v1.0)
#### NOTE: `@parallel (<reduction>) for` allocates workers up front

In [101]:
tic()
numPoints = 1E9
inside = @parallel (+) for i = 1:numPoints
    x = rand()
    y = rand()
    Int(x^2 + y^2 < 1)
end
toc()
println(4 * inside / numPoints)

elapsed time: 3.258154472 seconds
3.14155292


## Be careful with `@parallel` when modifying variables

In [3]:
a = zeros(10)
@parallel for i = 1:10
    a[i] = i
end;
println(a)

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


**NOTE: What happened here is that each process got its own copy of a to do its processing.  As a result, the vector a on the master process did not change.**

## Use a `SharedArray` to share values between all processes

In [4]:
a = SharedArray{Float64}(10)
result = @parallel for i = 1:10
    a[i] = i
end;
println(a)

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]
