# Simple Parallelization Exercise

In [1]:
addprocs(4)

4-element Array{Int64,1}:
 2
 3
 4
 5

### Distribute works to different processes
I follow the official documentation [julia parallel computing](http://docs.julialang.org/en/latest/manual/parallel-computing/)

In [2]:
# This function retuns the (irange,jrange) indexes assigned to this worker
@everywhere function myrange(q::SharedArray)
    idx = indexpids(q)
    if idx == 0
        # This worker is not assigned a piece
        return 1:0, 1:0
    end
    nchunks = length(procs(q))
    splits = [round(Int, s) for s in linspace(0,size(q,2),nchunks+1)]
    1:size(q,1), splits[idx]+1:splits[idx+1]
end

### Adding two matrices

In [7]:
# Here's the kernel
@everywhere function advection_chunk!(q, u, irange, jrange)
   # @show (irange, jrange)  # display so we can see what's happening
    for j in jrange, i in irange
        q[i,j] = q[i,j] +  u[i,j]
    end
    q
end


# Here's a convenience wrapper for a SharedArray implementation
@everywhere advection_shared_chunk!(q, u) = advection_chunk!(q, u, myrange(q)...)

### Serial code (without parallelization)

In [8]:
advection_serial!(q, u) = advection_chunk!(q, u, 1:size(q,1), 1:size(q,2))

advection_serial! (generic function with 1 method)

### Parallel code

In [14]:
function advection_shared!(q, u)
    @sync begin
        for p in procs(q)            
            @time    @async remotecall_wait(advection_shared_chunk!, p, q, u)
        end 
        end
    q
end

advection_shared! (generic function with 1 method)

In [6]:
q = SharedArray(Float64, (5,10000))
u = SharedArray(Float64, (5,10000))
advection_serial!(q,u);
advection_shared!(q,u);

(irange,jrange) = (1:5,1:10000)
	From worker 3:	(irange,jrange) = (1:5,2501:5000)
	From worker 2:	(irange,jrange) = (1:5,1:2500)
	From worker 5:	(irange,jrange) = (1:5,7501:10000)
	From worker 4:	(irange,jrange) = (1:5,5001:7500)


In [11]:
@time advection_serial!(q,u);

  0.000243 seconds (4 allocations: 160 bytes)


In [13]:
@time advection_shared!(q,u);

  0.001272 seconds (2.59 k allocations: 198.359 KB)


### Encapsulated serial code is faster than parallel code. What is going on???

It seems that there is a lot of overhead to parallel computation. Unless each worker is doing a fairly significant amount of work, speedup is not achieved via parallelization.

To see whether it is the case, I time each work done by each worker. From the total time, I can subtract these times to get time related to overhead.

In [18]:
@time advection_shared!(q,u);

  0.000005 seconds (6 allocations: 928 bytes)
  0.000004 seconds (5 allocations: 880 bytes)
  0.000002 seconds (5 allocations: 880 bytes)
  0.000002 seconds (5 allocations: 880 bytes)
  0.003274 seconds (3.25 k allocations: 229.750 KB)


The table below shows the results from experimenting with number of workers and number of grid points.

| workers |  grid points  |  serial time   |  parallel time |  Ratio   |
|---------|---------------|----------------|----------------|----------|
|   3     |     10000     |    0.000224    |    0.001189    |   5.30   |
|         |    100000     |    0.001433    |    0.001778    |   1.2    |
|         |    200000     |    0.003179    |    0.002934    |   0.92   |
|   5     |     10000     |    0.000228    |    0.001595    |   6.99   |
|         |    100000     |    0.001636    |    0.002169    |   1.32   |
|         |    200000     |    0.003357    |    0.002605    |   0.77   |
|   7     |     10000     |    0.000188    |    0.002354    |  12.52   |
|         |    100000     |    0.001403    |    0.002680    |   1.91   |
|         |    200000     |    0.003055    |    0.004268    |   1.39   |
