## Add available processors

Julia interpreter starts with one core process (main julia server)

In [None]:
ENV["LINES"] = 10
ENV["JULIA_NUM_THREADS"] = 4
using Distributed
nprocs()

To avoid over-allocating the number of workers, we check first if currently we have one processor and then allocate 3 more processors.

In [None]:
nprocs() == 1 && addprocs(;exeflags="--project")

Let's validate the number of workers we have, not counting the main Julia server (worker id=1).

In [None]:
workers()

## Remote Calls and References

We use `@spawn` macro to dynamically assign a task to an available remote worker.

In [None]:
ref = @spawn sin(π/5)

The return type is a Future because the value may happen in the future or later due to network/communication delays. This is a non-blocking call which means Julia main processor does not wait for the task to finish. If you want for Julia to wait, you can use `@sync` command but it will defeat the purpose of parallelization if you wait for the remote call to finish before doing other independent tasks.

To get the value at a later time, you can use `fetch`. This is a blocking call so Julia waits until the `fetch` call gets its value.

In [None]:
fetch(ref)

Here's an example of loops with remote calls. For simpler discussion, we use `sin` task but ideally, the task ideal for remote calls are those that takes time so that they can be run in parallel in the background.

We declare first an `Array` of `Futures` which will be populated by `Future` values by `@spawn` inside the loop.

In [None]:
n=10
refs = Array{Future,1}(undef,n)
for i = 1:n
    refs[i] = @spawn sin(i)
end

`refs` now contain an Array of Futures.

In [None]:
refs

We can map the `fetch` function to each of the `Futures` and aggregate (reduce) them by getting the sum.

In [None]:
reduce(+,map(fetch,refs))

A more elegant way to do this is using `@distributed` as show below.

In [None]:
res=@distributed (vcat) for i=1:n
    sin(i)
end

In [None]:
reduce(+,res)

You can also replace (vcat) by (+) in the `@distributed`.

In [None]:
res = @distributed (+) for i in 1:10
    println("processing: ",i)
    sin(i)
end

In [None]:
[@spawn sin(i) for i in 1:10]  .|> fetch   |> sum

Here's another example which concatenates each `DataFrame` containing worker and their corresponding task result.

You will notice the pattern of task assignment. Workers are rotated sequentially in the beginning and then any idle worker will get the next task.

In [None]:
@everywhere using DataFrames
res=@distributed (vcat) for i = 1:10
    println((i,sin(i)))
    DataFrame(worker=myid(), vals=sin(i))
end
res

## Channel

In [None]:
function producer(c::Channel)     
    for n=1:10
       put!(c, n*n)
    end
end

In [None]:
task = Channel(producer)

In [None]:
take!(task)

In [None]:
for tsk in Channel(producer)
    @spawn println("received task: ",tsk)
end

In [None]:
[ @spawn tsk for tsk in Channel(producer)] .|> fetch |> sum

## Multi-threading

In [None]:
@everywhere using Base.Threads

In [None]:
nthreads()

In [None]:
@threads for i = 1:10
    id = threadid() 
    println("threaid: ",id)
end

In [None]:
@sync @distributed for i=1:nprocs()
    @threads for j=1:nthreads()
        id = Threads.threadid() 
        println("threaid: ",id)
    end
end

## Monte-Carlo Simulation to estimate $\pi$

In [None]:
#==========================#
# monte-carlo simulation
# π r^2 / 4 r^2 = s/n 
#==========================#


@everywhere function isInside() 
    x = rand()
    y = rand()
    x^2 + y^2 < 1 ? 1 : 0
end;

@everywhere function ppi(n)
    s=@distributed (+) for i = 1:n
        isInside()
    end
    4s/n
end;

function pi(n)
    s=0.0
    for i = 1:n
        s+=isInside()
    end
    4s/n
end;


In [None]:
@time ppi(10^9)

In [None]:
@time pi(10^9)

## Cross-validation in parallel

In [None]:
@everywhere using RDatasets
@everywhere using Statistics
@everywhere using DecisionTree
@everywhere using Random

@everywhere function partitionTrainTest(data, at = 0.7)
    n = nrow(data)
    idx = shuffle(1:n)
    train_idx = view(idx, 1:floor(Int, at*n))
    test_idx = view(idx, (floor(Int, at*n)+1):n)
    return (data[train_idx,:], data[test_idx,:])
end


@everywhere function irisAcc() 
    iris = dataset("datasets", "iris")
    train,test = partitionTrainTest(iris, 0.7) # 70% train
    xtrain = train[:, 1:4] |>Matrix;
    ytrain = train[:, 5] |> Vector{String}
    xtest = test[:, 1:4] |>Matrix;
    ytest = test[:, 5] |> Vector{String}
    model = build_forest(ytrain, xtrain, 2, 4, 0.5, 6);
    pred = apply_forest(model,xtest);
    sum(ytest .== pred) / length(pred)
end

In [None]:
irisAcc()

In [None]:
function mserial(n)
    sm=0.0
    for i=1:n
         sm += irisAcc()
    end
    return sm/n*100.0
end
@time mserial(10000)

In [None]:
function mparallel(n)
    s=@distributed (+) for i=1:n
        irisAcc()
    end
    return s/n*100.0
end
@time  mparallel(10000)