# Message Passing using MPI



Let's first startup the IPyParallel cluster so that we can use MPI in this Jupyter notebook. For this notebook, it is a good idea to start the cluster with 8 MPI ranks (check output below). Once we have connected to the cluster we issue `%autopx` which means that the commands of all following code cells will be executed on 8 parallel workers.

In [1]:
import numpy as np
import ipcmagic
import ipyparallel as ipp
%ipcluster start -n 8 --mpi
rc = ipp.Client()
rc.ids
dv = rc[:]
dv.activate()
dv.block = True
print("Running IPython Parallel on {0} MPI engines".format(len(rc.ids)))
print("Commands in the following cells will be executed on the workers in parallel (disable with %autopx)")
%autopx

IPCluster is ready! (7 seconds)
Running IPython Parallel on 8 MPI engines
Commands in the following cells will be executed on the workers in parallel (disable with %autopx)
%autopx enabled


We are going to use MPI for this notebook. `mpi4py` is a Python interface to MPI. There are interfaces to MPI in almost all programming languages (Fortran, C/C++, Python, Julia, ...). The [documentation](https://mpi4py.readthedocs.io/en/stable/) is a good resource for an introduction and further information.

In [2]:
import time
import timeit
import numpy as np
import matplotlib.pyplot as plt 
from mpi4py import MPI

## Welcome to COMM_WORLD!

With the commands above we have 8 parallel processeses (workers) running on our compute node. For those familiar with the `top -u courseXX` command, you can head over to a `File` &rarr; `New` &rarr; `Terminal` and you will find 8 instances of `ipengine` running, which are the 8 workers if you display only the processes for your user.

MPI works with groups of these workers called communicators. Each MPI command accepts a communicator and the command will only apply to the processes in this communicator. There is a base communicator called `COMM_WORLD` which contains *all* processes that are available.

In [3]:
comm = MPI.COMM_WORLD

Since we have enabled `%autopx` above, the previous cell and all following cells are being executed on *all* ranks simultaneously and in parallel. So now there is a variable named `comm` defined on all 8 Python processes running on the node and it points to the `COMM_WORLD` communicator.

We can quickly validate that this is true, by defining and printing a variable on all ranks. JupyterHub automatically gathers the standard output from the connected ranks and prints all 8 messages as a result of the cell execution.

In [4]:
a = np.random.rand(1)
print(a)

[stdout:0] [0.07322831]
[stdout:1] [0.38493535]
[stdout:2] [0.34325771]
[stdout:3] [0.51867848]
[stdout:4] [0.92577974]
[stdout:5] [0.71155992]
[stdout:6] [0.00285285]
[stdout:7] [0.06317337]


Now we can query our communicator in order to figure out how many ranks it contains. Also, we can query the rank number of each of the processes running. MPI will automatically assign each rank a unique integer increasing from 0.

In [5]:
size = comm.Get_size()
rank = comm.Get_rank()
print("I am rank {} of a total of {} ranks.".format(rank, size))

[stdout:0] I am rank 4 of a total of 8 ranks.
[stdout:1] I am rank 3 of a total of 8 ranks.
[stdout:2] I am rank 7 of a total of 8 ranks.
[stdout:3] I am rank 6 of a total of 8 ranks.
[stdout:4] I am rank 1 of a total of 8 ranks.
[stdout:5] I am rank 0 of a total of 8 ranks.
[stdout:6] I am rank 5 of a total of 8 ranks.
[stdout:7] I am rank 2 of a total of 8 ranks.


Similar to shared memory parallelism with OpenMP, sometimes there are things that cannot be done in parallel. Good examples for this are writing data to disk or plotting the result of a computation. For this purpose, we define a *master rank* responsible for executing these parts of the code. By default the master rank is typically rank 0. This is due to the simple fact that rank 0 always exists, even if a MPI-parallel program is executed with a single worker.

In [6]:
# this runs in parallel
a = 1

# now execute I/O only on master rank
if rank == 0:
    a = 42
    
# this runs in parallel again
b = 1

# check values
print("I am rank {} and a = {} b = {}".format(rank, a, b))

[stdout:0] I am rank 4 and a = 1 b = 1
[stdout:1] I am rank 3 and a = 1 b = 1
[stdout:2] I am rank 7 and a = 1 b = 1
[stdout:3] I am rank 6 and a = 1 b = 1
[stdout:4] I am rank 1 and a = 1 b = 1
[stdout:5] I am rank 0 and a = 42 b = 1
[stdout:6] I am rank 5 and a = 1 b = 1
[stdout:7] I am rank 2 and a = 1 b = 1


Note that one of the main differences to the OpenMP programming model is that with MPI the default is to run things in parallel and sequential regions have to be programmed explicitly (using an `if rank == 0` statement in the example above), whereas in the OpenMP programming model parallel regions have to be indicated with directives and the default is sequential execution.

The data model is also fundamentally different to OpenMP. In OpenMP all threads could access the same variable using the `shared(variable)` clause to a parallel region. This is no longer possible in a distributed memory model. Drawing the analogy with OpenMP, essentially all variables are private to each MPI rank. This is illustrated nicely above where the random number generated by each rank or the `rank` variable stored in the memory space of each MPI rank had a different value.

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>1.</b> Create a numpy array <code>a</code> with 65'000'000 random numbers (on all ranks). Time how long it takes to compute <code>b = np.arctan(a)</code> of the array on a single rank and on all ranks. You can use the <code>timeit.default_timer()</code> method (see examples further down in this notebook) for measure the execution time. How long does it take to do the work on only 1 rank? How long does it take to do 8 times more work on 8 ranks? What would you have expected? Why?<br>
</div>

In [7]:
# TODO parallel arctan
a = np.random.rand(65000000)

def work(a):
    return np.arctan(a)

comm.barrier()

tic = timeit.default_timer()
if rank == 0:
    b = work(a)
toc = timeit.default_timer()
elapsed_time = toc - tic
if rank == 0:
    print('Elapsed time on single rank = {} s'.format(elapsed_time))

comm.barrier()
    
tic = timeit.default_timer()
b = work(a)
toc = timeit.default_timer()
elapsed_time = toc - tic
print('Elapsed time on all ranks = {} s'.format(elapsed_time))

[stdout:0] Elapsed time on all ranks = 1.8269113020505756 s
[stdout:1] Elapsed time on all ranks = 1.826852158876136 s
[stdout:2] Elapsed time on all ranks = 1.8243743521161377 s
[stdout:3] Elapsed time on all ranks = 1.8295371860731393 s
[stdout:4] Elapsed time on all ranks = 1.8354995879344642 s
[stdout:5] 
Elapsed time on single rank = 1.8239165570121258 s
Elapsed time on all ranks = 1.8433416741900146 s
[stdout:6] Elapsed time on all ranks = 1.8259539608843625 s
[stdout:7] Elapsed time on all ranks = 1.8260860070586205 s


*Solution:*

Computing the arctan() is a very expensive operation which involves a significant amount of floating-point operations. The runtime is dominated by the flops in this case. Since the 8 cores can work independently, the runtime is independent of the number of cores participating in the computation. We basically have a 8x speedup on 8 cores, which is really nice. An interesting experiment is to replace the arcsin(a) with simple a.copy(). One can see two things, the runtime goes significantly down (meaning that the the original version was not dominated by memory bandwidth concerns), and also the runtime between a single rank and 8 ranks is now significantly different.

## Point-to-Point Communication

The main functionality that the MPI API has to provide is a means of exchanging data (messages) between the workers running in parallel. Since the workers do not share the same address space, they can not access each others variables.

In [8]:
if rank == 0:
    c = 42
    
if rank == 1:
    print(c)

CompositeError: one or more exceptions from call to method: execute
[4:execute]: NameError: name 'c' is not defined

So how can we send the variable a from rank 0 to rank 1? For this purpose, MPI provides point-to-point communication semantics. These are basically a set of methods that can be used to send information from a single rank to another single rank.

In [9]:
c = None

if rank == 0:
    c = 42
    comm.send(c, dest=1, tag=1001)
    c = "When words fail, music speaks."

if rank == 1:
    c = comm.recv(source=0, tag=1001)
    print(c)

[stdout:4] 42


It is instructive to see the values of `c` after executing the above cell. Note that the master rank changed the value of c to "When words fail, music speaks." *after* having sent the value to rank 1. So rank 1 does not know about the update. The symbol `c` refers to 8 different copies of the variable which can all have different values. As in the example above, they can also be of different type or or size.

In [10]:
print("On rank {} c = {}".format(rank, c))

[stdout:0] On rank 4 c = None
[stdout:1] On rank 3 c = None
[stdout:2] On rank 7 c = None
[stdout:3] On rank 6 c = None
[stdout:4] On rank 1 c = 42
[stdout:5] On rank 0 c = When words fail, music speaks.
[stdout:6] On rank 5 c = None
[stdout:7] On rank 2 c = None


Note that the MPI send and receive methods allow for tagging messages with a unique ID (e.g. `tag=1001` in the example above). In the above example, the message would never have arrived if the tags would not have been chosen to match.

**Note: If you try out changing the tag and reexecuting the above cell, you will need to [restart the IPyParallel Cluster](#restarting) (see Deadlock below) and then come back here.**

Tagging helps avoid errors when there are many messages from many senders arriving at the same fault or when there several messages of different type are being received from the same sender. When the receiver requests a message with a specific tag number, messages with different tags will be buffered until the receiver is ready for them. Let's make a simple example.

In [11]:
if rank == 1:
    c = 1
    comm.send(c, dest=0, tag=1001)

if rank == 2:
    c = 2
    comm.send(c, dest=0, tag=1002)

if rank == 0:
    c1 = comm.recv(source=1, tag=1001)
    c2 = comm.recv(source=2, tag=1002)
    print(c1, c2)

[stdout:5] 1 2


There is no way of knowing in which order the messages from rank 2 and rank 1 will arrive at rank 0. If we would not have specified a tag, rank 0 might end up waiting for the message from rank 1 indefinitely because the message from rank 2 arrive before (deadlock).

You can try switching around the order of the receives on rank 0 and re-run. You can also try removing the tags, although this might put your MPI library in a undefined state and you might have to restart the JupyterHub in order to start with a fresh environment (see [Restarting the IPyParallel Cluster](#restarting)).

<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>2.</b> Create a program where two ranks repeatedly pass an integer back and forth a given number of times (start with 1, gradually increase to 500'000). Measure the time and compute the time per message in µs. This time is called the latency $L$.<br>
</div>

In [12]:
# TODO Ping-pong / latency
c = 42
num_iter = 500000

tic = timeit.default_timer()

for i in range(num_iter):
    if rank == 0:
        comm.send(c, dest=1, tag=1001)
        c = comm.recv(source=1, tag=1002)

    if rank == 1:
        c = comm.recv(source=0, tag=1001)
        comm.send(c, dest=0, tag=1002)

toc = timeit.default_timer()
elapsed_time = toc - tic

latency = elapsed_time / num_iter
if rank in [0, 1]:
    print("Latency on rank {} is {} µs".format(rank, latency * 1.0e6))

[stdout:4] Latency on rank 1 is 3.3983203801326454 µs
[stdout:5] Latency on rank 0 is 3.403954280074686 µs


A simple performance model for communication would compute the time for a message $T = L + s / B$, where $s$ is the message size and $B$ is the bandwidth. On a modern HPC system latency is $0.001 - 0.01 \, µs$ and bandwidths are $0.1 - 10\,\mathrm{GB}/s$ for MPI communication. Our latency is significantly higher because of the overhead Python incurs.

## Array data communication

Most MPI methods contain two version, one with a capital letter in front (`comm.Send()`) and one with a lowercase letter in front (`comm.send()`). The main difference is that the latter (that we used above), can be used to send any arbitrary Python object around. Internally, mpi4py uses the `pickle` package (see [pickle documention](https://docs.python.org/3/library/pickle.html) for more details) to serialize the object into a byte stream which is then unpickled on the reciever side. This type of communication is attached with an overhead and is rarely used in high-

Often we simply want to send a raw data array using the MPI API. For this purpose, we can use the methods which start with an uppercase letter.

In [13]:
if rank == 0:
    c = np.array( [42], dtype=np.float64 )
    comm.Send(c, dest=1, tag=1002)
    
if rank == 1:
    c = np.empty( (1,), dtype=np.float64 )
    comm.Recv(c, source=0, tag=1002)
    print(c)

[stdout:4] [42.]


If we are sending around raw data buffers, we need to pre-allocate the storage on the receiver. You can see that the receive method no longer returns a value but requires an argument where the data that is recieved is written into. As a consequence, the receiver has to know in advance the type and amount of data that it will receive from the sender. If there is a mismatch between the data type or the number of data elements, your programm will crash and may be in an undefined state.

In [14]:
if rank == 0:
    c = np.array( [42, 43], dtype=np.float64 )
    comm.Send(c, dest=1, tag=1002)
    
if rank == 1:
    c = np.empty( (1,), dtype=np.float64 )
    comm.Recv(c, source=0, tag=1002)
    print(c)

CompositeError: one or more exceptions from call to method: execute
[4:execute]: Exception: Message truncated, error stack:
MPI_Recv(212)...........................: MPI_Recv(buf=0x14ca670, count=1, MPI_DOUBLE, src=0, tag=1002, MPI_COMM_WORLD, status=0x1) failed
MPIDI_CH3_PktHandler_EagerShortSend(376): Message from rank 0 and tag 1002 truncated; 16 bytes received but buffer size is 8

You can also get silent errors, which are very hard to debug. The MPI library is only really concerned with the size of the send and receive buffer. The necessary and sufficient condition for a send/recv operation to complete successfully, is that the receive buffer is at least as large in number of bytes as the send buffer.

Try chaning the sending side to sending `float32` in the above example. Now the sender will send 2 values of type `float32` which corresponds to a total of 8 bytes. The receiver is expecting one value of type `float64` which also corresponds to a total of 8 bytes. The correct data has been transferred, it's just that the interpretation of the data on the receiving side is incorrect.

Change back the type to `float64` and try increasing the number of elements in the receive buffer from 1 to 4 and rerun the example above. You can see, that as long the number of elements received is smaller or equal to the buffer size, MPI does not issue an error. While the first two values `c[1]`, `c[2]` contain the correct 42 and 43, the values `c[3]`, `c[4]` are undefined, because `c` has been allocated using `np.empty()` which does not initialize the initial values of `c`.

The main advantage of using the array data API of the `mpi4py` is speed. In the example below we are transferring a numpy data array of considerable size multiple times using the array data API and the pickle/unpickle API of mpi4py. You can see that there is a considerable time difference and if speed is of concern, the array interface is preferable, although it provides less safety.

In [15]:
num_bytes = 1024 * 1024
num_iter = 2048
c = np.random.rand(num_bytes // 8)

if rank == 0:
    tic = timeit.default_timer()
    for iter in range(num_iter):
        comm.Send(c, dest=1, tag=1003)
    toc = timeit.default_timer()
    print("Rank 0 spent {:.4f}s sending {} GB using array data".format(toc-tic, num_bytes * num_iter / 1024**3))

    tic = timeit.default_timer()
    for iter in range(num_iter):
        comm.send(c, dest=1, tag=1004)
    toc = timeit.default_timer()
    print("Rank 0 spent {:.4f}s sending {} GB using pickle/unpickle".format(toc-tic, num_bytes * num_iter / 1024**3))

if rank == 1:
    tic = timeit.default_timer()
    for iter in range(num_iter):
        comm.Recv(c, source=0, tag=1003)
    toc = timeit.default_timer()
    print("Rank 1 spent {:.4f}s receiving {} GB using using array data".format(toc-tic, num_bytes * num_iter / 1024**3))

    tic = timeit.default_timer()
    for iter in range(num_iter):
        c = comm.recv(source=0, tag=1004)
    toc = timeit.default_timer()
    print("Rank 1 spent {:.4f}s receiving {} GB using using pickle/unpickle".format(toc-tic, num_bytes * num_iter / 1024**3))

[stdout:4] 
Rank 1 spent 0.1666s receiving 2.0 GB using using array data
Rank 1 spent 1.8896s receiving 2.0 GB using using pickle/unpickle
[stdout:5] 
Rank 0 spent 0.1665s sending 2.0 GB using array data
Rank 0 spent 1.8896s sending 2.0 GB using pickle/unpickle


<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>3.</b> Compute the communication bandwdith from the above values for array data and pickle/unpickle communication. We are much closer to the values we would expect from a modern HPC system (see above). Can you explain why this is the case for bandwidth and not latency?<br>
</div>

*Solution:*

The bandwidth is 2.0 GB / 0.1609s = 12.4 GB/s for the array data communication API of mpi4py.

The bandwidth is 2.0 GB / 1.9161s = 1.04 GB/s for the pickle/unpickle communicatin API of mpi4py.

The reason why bandwidth is much closer to system specs as compared to latency, is that for bandwidth the overhead that Python imposes only happens once, whereas the overhead of Python for the latency test happened for every iteration.

## Deadlock

Deadlock is a situation where a MPI rank (worker) is waiting for an MPI operation to complete which never does because of an error in the program logic. Such errors can be hard to debug and careful design and checking of the user code may be more efficient than trial and error when parallelizing a sequential code with MPI.

A classical example is if a rank is trying to receive a message that has never been sent. Try commenting out the `Send()` after having run the example below for a first time. If there is no `Send()` the `Recv()` will simply hang and wait for a message to arrive.

You have choose the `Kernel` &rarr; `Interrupt Kernel` menu option in order to stop the kernel. Unfortunately, the state of workers is unefined after such an error (see [Stopping the IPyParallel Cluster](#restarting) at the end of this notebook).

In [None]:
a = np.array([42.], dtype=np.float64)

if rank == 0:
    print("Sending message on rank 0")
    comm.Send(a, dest=1)

if rank == 1:
    print("Receiving message on rank 1")
    comm.Recv(a, source=0)
    print(a)

Another classical way to produce a deadlock is the situation when two ranks want to exchange a piece of information (data from a numpy array in our case). Do you understand why the code below deadlocks?

**Warning: you will have to abort the kernel and possibly restart the ipcluster once more, sorry!**

The reason for the deadlock is that MPI internally often uses a handshake protocol. Normally, in a deadlock free situation, the conversation between rank 0 and 1 in the following way:

- Rank 0: Hey, I would like to send you 128 MB of data (Request to send, RTS)
- Rank 1: Ok, I have a matching receive (Clear to send, CTS)
- Rank 0: I'm sending you the data (RDMA)
- Rank 0: Done, the data is in your memory (Finished transmission, FIN)

This is called the *rendevous protocol* or handshake protocol. Now in the situation below both ranks request to send data and wait for the other rank.

An easy way to fix the program in the cell below is to switch the order of the `Send()` and `Recv()` around on rank 1. Try it!

In [None]:
num_elements = 16 * 1024 * 1024
a = np.random.rand(num_elements)
b = np.empty(num_elements, dtype=np.float64)

if rank == 0:
    comm.Send(a, dest=1, tag=100)
    comm.Recv(b, source=1, tag=101)
    print('a has been received in b on rank 0')

if rank == 1:
    comm.Send(a, dest=0, tag=101)
    comm.Recv(b, source=0, tag=100)
    print('a has been received in b on rank 1')

Since this is such a common situation, a special MPI API method called `Sendrecv()` is provided exactly for this use case.

In [16]:
num_elements = 16 * 1024 * 1024
a = np.random.rand(num_elements)
b = np.empty(num_elements, dtype=np.float64)

if rank == 0:
    comm.Sendrecv(sendbuf=a, dest=1, sendtag=100, recvbuf=b, source=1, recvtag=101)
    print('a has been received in b on rank 0')

if rank == 1:
    comm.Sendrecv(sendbuf=a, dest=0, sendtag=101, recvbuf=b, source=0, recvtag=100)
    print('a has been received in b on rank 1')

[stdout:4] a has been received in b on rank 1
[stdout:5] a has been received in b on rank 0


## Non-blocking Communication

The send and receive methods introduced above are *blocking*, in the sense that they do not return until the communcation has been executed. On many systems, performance can be significantly increased by overlapping communication and computation. This is particularly true on systems where communication is executed autonomously by an intelligent, dedicated communication controller, such as modern supercomputers.

For this purpose, MPI provides *nonblocking* methods. The general pattern for a non-blocking operation is `req = MPI.Isomething()` which initiates the communication and then followed by a `req.wait()` later in the code which waits until the communication operation has completed (if it has not already done so). This allows to overlap communication and computation with the following pattern.

Let's revisit the deadlock problem above. We can retain the same order of send/receive operations on all ranks but use a non-blocking send instead.

In [17]:
num_elements = 16 * 1024 * 1024
a = np.random.rand(num_elements)
b = np.empty(num_elements, dtype=np.float64)

if rank == 0:
    req = comm.Isend(a, dest=1, tag=100)
    comm.Recv(b, source=1, tag=101)
    req.wait()
    print('a has been received in b on rank 0')

if rank == 1:
    req = comm.Isend(a, dest=0, tag=101)
    comm.Recv(b, source=0, tag=100)
    req.wait()
    print('a has been received in b on rank 1')

[stdout:4] a has been received in b on rank 1
[stdout:5] a has been received in b on rank 0


<div class="alert alert-block alert-info">
<b>Now it's your turn...</b><br>
<b>4.</b> Write a program in which the ranks are arranged in a ring (0 &rarr; 1, 1 &rarr; 2, ... , 7 &rarr; 0). Send the rank number of each rank around the ring until it get's back to the sender.<br>
<b>5.</b> Assuming that you have used non-blocking communication in your implementation, can you think of a solution with blocking communication?<br>
</div>

In [18]:
# TODO Ring

c = None
message_counter=0

while c != rank:

    if c is None:
        c = rank

    next_rank = (rank + 1) % size
    req = comm.isend(c, dest=next_rank, tag=1000 + message_counter)
    
    previous_rank = (rank - 1) % size
    d = comm.recv(source=previous_rank, tag=1000 + message_counter)
    
    req.wait()

    message_counter += 1
    c = d

print("Rank {} sent {} messages and has c = {}".format(rank, message_counter, c))

[stdout:0] Rank 4 sent 8 messages and has c = 4
[stdout:1] Rank 3 sent 8 messages and has c = 3
[stdout:2] Rank 7 sent 8 messages and has c = 7
[stdout:3] Rank 6 sent 8 messages and has c = 6
[stdout:4] Rank 1 sent 8 messages and has c = 1
[stdout:5] Rank 0 sent 8 messages and has c = 0
[stdout:6] Rank 5 sent 8 messages and has c = 5
[stdout:7] Rank 2 sent 8 messages and has c = 2


## Synchronisation

In a distributed memory system individual workers can progress at their own speed and may be at completely different place in the program execution. A simple example is given below, where the ranks 0, 1, and 2 differ by over 2 s where they reach different points in the program.

In [19]:
t1 = timeit.default_timer()

if rank == 0:
    time.sleep(1)

# Point 2
t2 = timeit.default_timer()

if rank == 1:
    time.sleep(2)

# Point 3
t3 = timeit.default_timer()

if rank in [0, 1, 2]:
    print("Rank {} reached Point 2 after {:.5f}s and Point 3 after {:.5f}s".format(rank, t2-t1, t3-t1))

[stdout:4] Rank 1 reached Point 2 after 0.00009s and Point 3 after 2.00235s
[stdout:5] Rank 0 reached Point 2 after 1.00115s and Point 3 after 1.00123s
[stdout:7] Rank 2 reached Point 2 after 0.00009s and Point 3 after 0.00016s


In analogy to the OpenMP `$!omp barrier` directive for synchronizing threads, MPI provides the `comm.Barrier()` method which synchronizes all of the ranks (workers). The "barrier" opens only once all of the ranks have reached the barrier. Insert a barrier before the timer at Point 2 and another barrier before the timer at Point 3. Rerun and compare the results.

Note that barriers can also very easily lead to deadlock situations. If a `comm.Barrier()` is put inside an `if`-statement it can happen that some ranks actually never reach the barrier and MPI will wait indefinitely. If you insert an `comm.Barrier()` in the `if rank == 1` body, you will experience yet another deadlock and will have to restart the IPyParallel cluster.

## Collective Communication

Collective communication are communication patterns that involve all ranks in a communicator. If `MPI.COMM_WORLD` is used as the communicator, collective communication involves all available ranks. In principle, collective communication is only a convenience, since all possible communication patterns can be implemented using point-to-point communication. But programs are much easier to read because the communication patterns are expressed on a higher-level and in a consistent manner.

Common collective communication patterns are one-to-all (*broadcast*, *scatter*), all-to-one (*gather*) and all-to-all (*allgather*, *alltoall*). We will only cover the most basic variants here. 

#### Broadcast (same data on all ranks)

A typical use case is that a configuration file is read on the root rank and the the configuration is distributed onto all other ranks. For example, in weather and climate models, the namelist parameters are often only read on the root rank and then broadcast to all the other ranks.

In [20]:
if rank == 0:
    data = {'alpha'  : 0.01,
            'active' : True}
else:
    data = None
data = comm.bcast(data, root=0)
print(data)

[stdout:0] {'alpha': 0.01, 'active': True}
[stdout:1] {'alpha': 0.01, 'active': True}
[stdout:2] {'alpha': 0.01, 'active': True}
[stdout:3] {'alpha': 0.01, 'active': True}
[stdout:4] {'alpha': 0.01, 'active': True}
[stdout:5] {'alpha': 0.01, 'active': True}
[stdout:6] {'alpha': 0.01, 'active': True}
[stdout:7] {'alpha': 0.01, 'active': True}


#### Scatter (distribute data to ranks)

Scatter takes a data array on a given root rank and distributes it across the ranks in equally sized chunks. A typical use caes for a scatter operation is when data is being read from disk and the distributed to the ranks in order to work on the data in parallel. For weather and climate models, very often the initial condition is read as entire model levels (often called *global fields*) which are then scattered to the different subdomains on the different ranks according to the domain-decomposition in the horizontal. (This is further discussed in the next notebook.)

In [21]:
num_elements = 6 * size

global_a = None
if rank == 0:
    global_a = np.linspace(0., num_elements - 1., num_elements)
    
print("Rank {} has global_a = {}".format(rank, global_a))

a = np.empty(num_elements // size, dtype=np.float64)

comm.Scatter(global_a, a, root=0)

print("Rank {} has a = {}".format(rank, a))

[stdout:0] 
Rank 4 has global_a = None
Rank 4 has a = [24. 25. 26. 27. 28. 29.]
[stdout:1] 
Rank 3 has global_a = None
Rank 3 has a = [18. 19. 20. 21. 22. 23.]
[stdout:2] 
Rank 7 has global_a = None
Rank 7 has a = [42. 43. 44. 45. 46. 47.]
[stdout:3] 
Rank 6 has global_a = None
Rank 6 has a = [36. 37. 38. 39. 40. 41.]
[stdout:4] 
Rank 1 has global_a = None
Rank 1 has a = [ 6.  7.  8.  9. 10. 11.]
[stdout:5] 
Rank 0 has global_a = [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47.]
Rank 0 has a = [0. 1. 2. 3. 4. 5.]
[stdout:6] 
Rank 5 has global_a = None
Rank 5 has a = [30. 31. 32. 33. 34. 35.]
[stdout:7] 
Rank 2 has global_a = None
Rank 2 has a = [12. 13. 14. 15. 16. 17.]


#### Gather (assemble data from ranks)


Gather is the inverse operation of scatter. It assembles equally sized chunks from the ranks back into a single data array on a specified root rank.

In [22]:
global_b = None
if rank == 0:
    global_b = np.empty_like(global_a)

comm.Gather(a, global_b, root=0)

if rank == 0:
    print("Rank {} has global_b = {}".format(rank, global_b))
    if np.all(global_a == global_b):
        print("Everything assembled back together on rank 0")

[stdout:5] 
Rank 0 has global_b = [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35.
 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47.]
Everything assembled back together on rank 0


<a id='restarting'></a>
## Restarting the IPyParallel Cluster

Sometimes, the state of the workers in undefined, for example after a deadlock and having to interrupt and restart the kernel. When this happens, it is best to exit the `%autopx` and stop the IPyParallel cluster using `%ipcluster stop`. Then one can restart the kernel and start exeucting the notebook from the beginning again. For convenience, you can simply execute the 4 cells below and then go back to where you were working before.

If this also doesn't help, you have to restart your JupyterHub Server by `File` &rarr; `Hub Control Panel` &rarr; `Stop Server` and start starting over again.

In [23]:
%autopx

%autopx disabled


In [24]:
%ipcluster stop

In [None]:
import numpy as np
import ipcmagic
import ipyparallel as ipp
%ipcluster start -n 8 --mpi
rc = ipp.Client()
rc.ids
dv = rc[:]
dv.activate()
dv.block = True
print("Running IPython Parallel on {0} MPI engines".format(len(rc.ids)))
print("Commands in the following cells will be executed on the workers in parallel (disable with %autopx)")
%autopx

In [None]:
import time
import timeit
import numpy as np
import matplotlib.pyplot as plt 
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()