# Scaling computations using parallel computing

## Przemysław Szufel

<a class="anchor" id="toc"></a>
## Table of content
  

1. [Multithreading](#multithreading)
2. [Green threading](#green)
3. [Multi-processing and distributed computing](#multiprocessing)
4. [(background material) Parallelize via Single Instruction Multiple Data (SIMD)](#simd)

Before running Jupyter notebook set in Julia number of threads.
This should be done *before* actually running the `notebook()` command.
The number of threads can be also set up in Julia options in Visual Studio code (if this is used to run this notebook).
```
# run this code from Julia console just before starting Jupyter Notebook
ENV["JULIA_NUM_THREADS"]=4
```

In [1]:
println("Number of threads that your Julia is run: ## $(Threads.nthreads())")

Number of threads that your Julia is run: ## 4


In [2]:
using BenchmarkTools, Distributed

<a class="anchor" id="multithreading"></a>
### Multithreading
---- [Return to table of contents](#toc) ---


In [3]:
Threads.nthreads()

4

In [4]:
function ssum(x)
    r, c = size(x)
    y = zeros(c)
    for i in 1:c
        for j in 1:r
            @inbounds y[i] += x[j, i]
        end
    end
    y
end

ssum (generic function with 1 method)

In [5]:
function tsum(x)
    r, c = size(x)
    y = zeros(c)
    Threads.@threads for i in 1:c
        for j in 1:r
            @inbounds y[i] += x[j, i]
        end
    end
    y
end


tsum (generic function with 1 method)

In [6]:
x = rand(1000,10000);

In [7]:
@time ssum(x)
@time ssum(x);

  0.031636 seconds (12.63 k allocations: 955.258 KiB, 58.72% compilation time)
  0.012530 seconds (2 allocations: 78.172 KiB)


In [8]:
@time tsum(x)
@time tsum(x);

  0.113835 seconds (44.38 k allocations: 3.116 MiB, 163.19% compilation time)
  0.004611 seconds (25 allocations: 80.969 KiB)


#### Locking mechanism for threads

In [9]:
function f_bad()
    x = 0
    Threads.@threads for i in 1:10^6
        x += 1
    end
    return x
end


f_bad (generic function with 1 method)

In [10]:
f_bad()

250390

In [11]:
function f_add()
    x = 0 
    for i in 1:10^6
        x += 1
    end
    x
end
@btime f_add()
    

  1.400 ns (0 allocations: 0 bytes)


1000000

In [12]:
function f_atomic()
    x = Threads.Atomic{Int}(0)
    Threads.@threads for i in 1:10^6
        Threads.atomic_add!(x, 1)
    end
    return x[]
end
f_atomic()

1000000

In [13]:
function f_spin()
    l = Threads.SpinLock()
    x = 0
    Threads.@threads for i in 1:10^6
        Threads.lock(l) do
            x += 1
        end
    end
    return x
end

function f_reentrant()
    l = ReentrantLock()
    x = 0
    Threads.@threads for i in 1:10^6
        Threads.lock(l) do
            x += 1
        end
    end
    return x
end


f_reentrant (generic function with 1 method)

In [14]:
using DataFrames
stats = DataFrame()
for f in [f_bad, f_atomic, f_spin, f_reentrant]
    for i = 1:2
        value, elapsedtime  = @timed f()
        push!(stats, (f=string(f),i=i, value=value, timems=elapsedtime*1000))
    end
end
println(stats)


[1m8×4 DataFrame[0m
[1m Row [0m│[1m f           [0m[1m i     [0m[1m value   [0m[1m timems   [0m
     │[90m String      [0m[90m Int64 [0m[90m Int64   [0m[90m Float64  [0m
─────┼───────────────────────────────────────
   1 │ f_bad            1   250787   42.907
   2 │ f_bad            2   251553   46.4362
   3 │ f_atomic         1  1000000   17.8701
   4 │ f_atomic         2  1000000   16.6961
   5 │ f_spin           1  1000000  388.594
   6 │ f_spin           2  1000000  333.608
   7 │ f_reentrant      1  1000000  448.225
   8 │ f_reentrant      2  1000000  402.902


<a class="anchor" id="green"></a>
### Green threading 
---- [Return to table of contents](#toc) ---


In [15]:
@time sleep(2)

  2.011078 seconds (44 allocations: 1.297 KiB)


In [16]:
@time t = @async sleep(4)

  0.011966 seconds (2.80 k allocations: 201.922 KiB, 99.21% compilation time)


Task (runnable) @0x000001d30fca4f90

In [17]:
t

Task (runnable) @0x000001d30fca4f90

In [18]:
function dojob(i)
    val = round(rand(), digits=2)
    sleep(val)   # this could be external computations with I/O
    i, val
end

dojob (generic function with 1 method)

In [19]:
result = Vector{Tuple{Int,Float64}}(undef, 8)

8-element Vector{Tuple{Int64, Float64}}:
 (2319971979218726962, 5.07178135319088e-86)
 (8101253776241144114, 1.8159689052051526e-152)
 (11172911606132, 0.0)
 (0, 0.0)
 (0, 0.0)
 (0, 0.0)
 (0, 0.0)
 (0, 0.0)

In [20]:
@time for i=1:8
    result[i] = dojob(i)
end
result

  2.884699 seconds (641 allocations: 41.672 KiB, 0.40% compilation time)


8-element Vector{Tuple{Int64, Float64}}:
 (1, 1.0)
 (2, 0.04)
 (3, 0.31)
 (4, 0.1)
 (5, 0.69)
 (6, 0.08)
 (7, 0.06)
 (8, 0.51)

In [21]:
result = Vector{Tuple{Int,Float64}}(undef, 8);
@time for i=1:8
   @async result[i] = dojob(i)
end
result

  0.000082 seconds (81 allocations: 7.055 KiB)


8-element Vector{Tuple{Int64, Float64}}:
 (2, 5.0e-324)
 (1, 0.0)
 (2, 5.0e-324)
 (1, 5.0e-324)
 (0, 0.0)
 (2, 5.0e-324)
 (1, 0.0)
 (0, 0.0)

In [22]:
result = Vector{Tuple{Int,Float64}}(undef, 8);
@time @sync for i=1:8
   @async result[i] = dojob(i)
end
result

  0.968239 seconds (1.81 k allocations: 97.492 KiB, 3.53% compilation time)


8-element Vector{Tuple{Int64, Float64}}:
 (1, 0.91)
 (2, 0.54)
 (3, 0.81)
 (4, 0.84)
 (5, 0.92)
 (6, 0.54)
 (7, 0.81)
 (8, 0.6)

<a class="anchor" id="multiprocessing"></a>
### Multi-processing and distributed computing
---- [Return to table of contents](#toc) ---


In [23]:
using Distributed

This code adds 4 workers (and avoids adding more)

In [24]:
addprocs(max(0, 5-nprocs()));

In [25]:
workers()

4-element Vector{Int64}:
 2
 3
 4
 5

In [26]:
function s_rand()
    n = 10^4
    x = 0.0
    for i in 1:n
        x += sum(rand(10^4))
    end
    x / n
end
 
@time s_rand()
@time s_rand()


  0.633230 seconds (20.00 k allocations: 763.397 MiB, 14.63% gc time)
  0.547106 seconds (20.00 k allocations: 763.397 MiB, 14.88% gc time)


4999.595373024474

In [27]:
using Distributed
 
function p_rand()
    n = 10^4
    x = @distributed (+) for i in 1:n
        # the last line will be aggregated
        sum(rand(10^4))
    end
    x / n
end

@time p_rand()
@time p_rand()


  1.500320 seconds (467.74 k allocations: 31.482 MiB, 41.57% compilation time)
  0.178763 seconds (426 allocations: 24.844 KiB)


5000.075405919275

In [28]:
workers()'

1×4 adjoint(::Vector{Int64}) with eltype Int64:
 2  3  4  5

In [29]:
fetch(@spawnat 3 4+3)

7

In [30]:
function myf() 
    println("I am on worker ", myid())
    rand()
end
myf()

I am on worker 1


0.8264494355420668

In [31]:
a = nothing
try 
    fetch(@spawnat 4 myf())
catch e
    println(e)
end

RemoteException(4, CapturedException(UndefVarError(Symbol("#myf")), Any[(deserialize_datatype at Serialization.jl:1399, 1), (handle_deserialize at Serialization.jl:867, 1), (deserialize at Serialization.jl:814, 1), (handle_deserialize at Serialization.jl:874, 1), (deserialize at Serialization.jl:814 [inlined], 1), (deserialize_global_from_main at clusterserialize.jl:160, 1), (#5 at clusterserialize.jl:72 [inlined], 1), (foreach at abstractarray.jl:3094, 1), (deserialize at clusterserialize.jl:72, 1), (handle_deserialize at Serialization.jl:960, 1), (deserialize at Serialization.jl:814, 1), (handle_deserialize at Serialization.jl:871, 1), (deserialize at Serialization.jl:814, 1), (handle_deserialize at Serialization.jl:874, 1), (deserialize at Serialization.jl:814 [inlined], 1), (deserialize_msg at messages.jl:87, 1), (#invokelatest#2 at essentials.jl:887 [inlined], 1), (invokelatest at essentials.jl:884 [inlined], 1), (message_handler_loop at process_messages.jl:176, 1), (process_tcp_s

In [32]:
@everywhere function myf() 
    println("I am on worker ", myid())
    rand()
end
fetch(@spawnat 4 myf())

      From worker 4:	I am on worker 4


0.8436694643610311

#### A typical pattern for setting an intial state across workers

In [33]:
using Distributed
@everywhere using Pkg
@everywhere Pkg.activate(".")
@everywhere using Distributed, Random, DataFrames

@everywhere function calc(x, y)
    2x + y
end

@everywhere function init_worker()    
   Random.seed!(myid())
    # reading initial data from files or other actions
end

@sync for wid in workers()
    @async fetch(@spawnat wid init_worker())
end


      From worker 2:	[32m[1m  Activating[22m[39m new project at `C:\AAABIBLIOTEKA\MIT_Boston\2024_MIT_18.S097_Introduction-to-Julia-for-Data-Science\Day-4b_Scaling-computations-using-parallel-computing`
      From worker 5:	[32m[1m  Activating[22m[39m new project at `C:\AAABIBLIOTEKA\MIT_Boston\2024_MIT_18.S097_Introduction-to-Julia-for-Data-Science\Day-4b_Scaling-computations-using-parallel-computing`
      From worker 3:	[32m[1m  Activating[22m[39m new project at `C:\AAABIBLIOTEKA\MIT_Boston\2024_MIT_18.S097_Introduction-to-Julia-for-Data-Science\Day-4b_Scaling-computations-using-parallel-computing`
      From worker 4:	[32m[1m  Activating[22m[39m new project at `C:\AAABIBLIOTEKA\MIT_Boston\2024_MIT_18.S097_Introduction-to-Julia-for-Data-Science\Day-4b_Scaling-computations-using-parallel-computing`


[32m[1m  Activating[22m[39m new project at `C:\AAABIBLIOTEKA\MIT_Boston\2024_MIT_18.S097_Introduction-to-Julia-for-Data-Science\Day-4b_Scaling-computations-using-parallel-computing`


Typically results are collected to a `DataFrame`

In [34]:
data = @distributed (append!) for (i, j) = vec(collect(Iterators.product(1:4, 1:3)))
    a = rand(1:499)
    b = rand(1:9)*1000
    c = calc(a, b)
    DataFrame(;i,j,a,b,c,procid = myid())
end

Row,i,j,a,b,c,procid
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64
1,1,1,441,5000,5882,2
2,2,1,378,7000,7756,2
3,3,1,476,3000,3952,2
4,4,1,263,4000,4526,3
5,1,2,444,6000,6888,3
6,2,2,269,7000,7538,3
7,3,2,25,7000,7050,4
8,4,2,195,9000,9390,4
9,1,3,151,9000,9302,4
10,2,3,384,9000,9768,5


<a class="anchor" id="simd"></a>
### (background material) Parallelize via Single Instruction Multiple Data (SIMD)
---- [Return to table of contents](#toc) ---



In [35]:
function dot1(x, y)
    s = 0.0
    for i in 1:length(x)
        @inbounds s += x[i]*y[i]
    end
    s
end

dot1 (generic function with 1 method)

In [36]:
function dot2(x, y)
    s = 0.0
    @simd for i in 1:length(x)
        @inbounds s += x[i]*y[i]
    end
    s
end

dot2 (generic function with 1 method)

In [37]:
x = 100*rand(10000)
y = 100*rand(10000);

@btime dot1($x, $y)
@btime dot2($x, $y)

  4.000 μs (0 allocations: 0 bytes)
  760.000 ns (0 allocations: 0 bytes)


2.479319049399765e7

In [38]:
res1 =  dot1(x, y)

2.4793190493997578e7

In [39]:
res2 =  dot2(x, y)

2.479319049399765e7

In [40]:
res1 == res2

false

In [41]:
@show res1 
@show res2

res1 = 2.4793190493997578e7
res2 = 2.479319049399765e7


2.479319049399765e7