## Add available processors

Julia interpreter starts with one core process (main julia server)

In [1]:
ENV["LINES"] = 10
ENV["JULIA_NUM_THREADS"] = 4
using Distributed
nprocs()

1

To avoid over-allocating the number of workers, we check first if currently we have one processor and then allocate 3 more processors.

In [2]:
nprocs() == 1 && addprocs(; exeflags="--project")

8-element Vector{Int64}:
 2
 3
 4
 ⋮
 8
 9

Let's validate the number of workers we have, not counting the main Julia server (worker id=1).

In [3]:
workers()

8-element Vector{Int64}:
 2
 3
 4
 ⋮
 8
 9

## Remote Calls and References

We use `@spawn` macro to dynamically assign a task to an available remote worker.

In [4]:
ref = @spawn sin(π/5)

Future(2, 1, 10, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)

The return type is a Future because the value may happen in the future or later due to network/communication delays. This is a non-blocking call which means Julia main processor does not wait for the task to finish. If you want for Julia to wait, you can use `@sync` command but it will defeat the purpose of parallelization if you wait for the remote call to finish before doing other independent tasks.

To get the value at a later time, you can use `fetch`. This is a blocking call so Julia waits until the `fetch` call gets its value.

In [5]:
fetch(ref)

0.5877852522924731

Here's an example of loops with remote calls. For simpler discussion, we use `sin` task but ideally, the task ideal for remote calls are those that takes time so that they can be run in parallel in the background.

We declare first an `Array` of `Futures` which will be populated by `Future` values by `@spawn` inside the loop.

In [6]:
n=10
refs = Array{Future,1}(undef,n)
for i = 1:n
    refs[i] = @spawn sin(i)
end

`refs` now contain an Array of Futures.

In [7]:
refs

10-element Vector{Future}:
 Future(3, 1, 12, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(4, 1, 13, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(5, 1, 14, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 ⋮
 Future(3, 1, 20, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)
 Future(4, 1, 21, ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0), nothing)

We can map the `fetch` function to each of the `Futures` and aggregate (reduce) them by getting the sum.

In [8]:
reduce(+,map(fetch,refs))

1.4111883712180104

A more elegant way to do this is using `@distributed` as show below.

In [9]:
res=@distributed (vcat) for i=1:n
    sin(i)
end

10-element Vector{Float64}:
  0.8414709848078965
  0.9092974268256817
  0.1411200080598672
  ⋮
  0.4121184852417566
 -0.5440211108893698

In [10]:
reduce(+,res)

1.4111883712180104

You can also replace (vcat) by (+) in the `@distributed`.

In [11]:
res = @distributed (+) for i in 1:10
    println("processing: ",i)
    sin(i)
end

      From worker 3:	processing: 3
      From worker 5:	processing: 6


1.4111883712180104

      From worker 2:	processing: 1
      From worker 2:	processing: 2
      From worker 4:	processing: 5


In [12]:
[@spawn sin(i) for i in 1:10]  .|> fetch   |> sum

      From worker 6:	processing: 7
      From worker 9:	processing: 10
      From worker 8:	processing: 9
      From worker 7:	processing: 8
      From worker 3:	processing: 4


1.4111883712180104

Here's another example which concatenates each `DataFrame` containing worker and their corresponding task result.

You will notice the pattern of task assignment. Workers are rotated sequentially in the beginning and then any idle worker will get the next task.

In [13]:
@everywhere using DataFrames
res=@distributed (vcat) for i = 1:10
    println((i,sin(i)))
    DataFrame(worker=myid(), vals=sin(i))
end
res

      From worker 7:	(8, 0.9893582466233818)
      From worker 6:	(7, 0.6569865987187891)
      From worker 8:	(9, 0.4121184852417566)
      From worker 9:	(10, -0.5440211108893698)
      From worker 4:	(5, -0.9589242746631385)
      From worker 5:	(6, -0.27941549819892586)
      From worker 2:	(1, 0.8414709848078965)
      From worker 3:	(3, 0.1411200080598672)
      From worker 2:	(2, 0.9092974268256817)
      From worker 3:	(4, -0.7568024953079282)


Unnamed: 0_level_0,worker,vals
Unnamed: 0_level_1,Int64,Float64
1,2,0.841471
2,2,0.909297
3,3,0.14112
4,3,-0.756802
5,4,-0.958924
6,5,-0.279415
7,6,0.656987
8,7,0.989358
9,8,0.412118
10,9,-0.544021


## Channel

In [14]:
function producer(c::Channel)     
    for n=1:10
       put!(c, n*n)
    end
end

producer (generic function with 1 method)

In [15]:
task = Channel(producer)

Channel{Any}(0) (1 item available)

In [16]:
take!(task)

1

In [17]:
for tsk in Channel(producer)
    @spawn println("received task: ",tsk)
end

      From worker 3:	received task: 25
      From worker 2:	received task: 16
      From worker 5:	received task: 49
      From worker 7:	received task: 1
      From worker 7:	received task: 81


In [18]:
[ @spawn tsk for tsk in Channel(producer)] .|> fetch |> sum

      From worker 9:	received task: 9
      From worker 6:	received task: 64
      From worker 8:	received task: 4
      From worker 8:	received task: 100
      From worker 4:	received task: 36


385

## Multi-threading

In [19]:
@everywhere using Base.Threads



In [20]:
nthreads()

4

In [21]:
@threads for i = 1:10
    id = threadid() 
    println("threaid: ",id)
end

threaid: 2
threaid: 4
threaid: 4
threaid: 2
threaid: 2
threaid: 1
threaid: 1
threaid: 1
threaid: 3
threaid: 3


In [22]:
@sync @distributed for i=1:nprocs()
    @threads for j=1:nthreads()
        id = Threads.threadid() 
        println("threaid: ",id)
    end
end

      From worker 4:	threaid: 2
      From worker 7:	threaid: 2
      From worker 8:	threaid: 2
      From worker 6:	threaid: 2
      From worker 5:	threaid: 2
      From worker 8:	threaid: 1
      From worker 6:	threaid: 3
      From worker 5:	threaid: 3
      From worker 7:	threaid: 4
      From worker 5:	threaid: 4
      From worker 4:	threaid: 3
      From worker 8:	threaid: 4
      From worker 6:	threaid: 4
      From worker 7:	threaid: 3
      From worker 4:	threaid: 4
      From worker 8:	threaid: 3
      From worker 2:	threaid: 2
      From worker 6:	threaid: 1
      From worker 7:	threaid: 1
      From worker 4:	threaid: 1
      From worker 5:	threaid: 1
      From worker 3:	threaid: 1
      From worker 2:	threaid: 3
      From worker 3:	threaid: 4
      From worker 2:	threaid: 4
      From worker 3:	threaid: 2
      From worker 2:	threaid: 1
      From worker 3:	threaid: 3
      From worker 3:	threaid: 1
      From worker 3:	threaid: 3
      From worker 3:	threaid: 4
      Fr

Task (done) @0x0000000111dece80

## Monte-Carlo Simulation to estimate $\pi$

In [23]:
#==========================#
# monte-carlo simulation
# π r^2 / 4 r^2 = s/n 
#==========================#


@everywhere function isInside() 
    x = rand()
    y = rand()
    x^2 + y^2 < 1 ? 1 : 0
end;

@everywhere function ppi(n)
    s=@distributed (+) for i = 1:n
        isInside()
    end
    4s/n
end;

function pi(n)
    s=0.0
    for i = 1:n
        s+=isInside()
    end
    4s/n
end;


In [24]:
@time ppi(10^9)

  2.148993 seconds (72.60 k allocations: 3.829 MiB, 1.97% compilation time)


3.141583216

In [25]:
@time pi(10^9)

  8.299566 seconds


3.14163316

## Cross-validation in parallel

In [26]:
@everywhere using RDatasets
@everywhere using Statistics
@everywhere using DecisionTree
@everywhere using Random

@everywhere function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    return (data[train_idx,:], data[test_idx,:])
end


@everywhere function irisAcc() 
    iris = dataset("datasets", "iris")
    train,test = partitionTrainTest(iris, 0.7) # 70% train
    xtrain = train[:, 1:4] |>Matrix;
    ytrain = train[:, 5] |> Vector{String}
    xtest = test[:, 1:4] |>Matrix;
    ytest = test[:, 5] |> Vector{String}
    model = build_forest(ytrain, xtrain, 2, 4, 0.5, 6);
    pred = apply_forest(model,xtest);
    sum(ytest .== pred) / length(pred)
end

In [27]:
irisAcc()

0.9333333333333333

In [30]:
function mserial(n)
    sm=0.0
    for i=1:n
         sm += irisAcc()
    end
    return sm/n*100.0
end
@time mserial(10000)

  4.865286 seconds (10.81 M allocations: 1.585 GiB, 4.31% gc time)


94.63488888889106

In [31]:
function mparallel(n)
    s=@distributed (+) for i=1:n
        irisAcc()
    end
    return s/n*100.0
end
@time  mparallel(10000)

  1.711370 seconds (27.06 k allocations: 1.411 MiB, 0.49% gc time, 2.22% compilation time)


94.61222222222166