
# Learning Objectives

We will learn about the following in this lab:

* Point-to-point and collective MPI communication routines.
* Managing the two-level hierarchy created by global and local rank of a process and how it accesses GPU(s).
* OpenMPI process mappings and its effect on application performance.

**Note:** Execution results can vary significantly based on the MPI installation, supporting libraries, workload manager, and underlying CPU and GPU hardware configuration and topology. The codes in this lab have been tested on DGX-1 8 Tesla V100 16 GB nodes connected by Mellanox InfiniBand NICs running OpenMPI v4.1.1 with HPCX 2.8.1 and CUDA v11.3.0.0 as well as DGX with 8 Ampere A100 80GB nodes (OpenMPI v4.1.1, HPC SDK 22.7, HPCX 2.11).

## MPI Inter-Process Communication

Let us learn more about how MPI communicates between processes.

### Point-to-Point communication

Two MPI processes can communicate directly (point-to-point) by sending and receiving data packets to and from each other. Both the sender and receivers processes must acknowledge the transaction using `MPI_Send` and `MPI_Recv` functions. MPI allows tagging messages to differenciate between various messages that processes may send to each other.

The function syntax for `MPI_Send` is:

```c
int MPI_Send(void* data, int count, MPI_Datatype datatype, int destination, 
         int tag, MPI_Comm communicator);
```

Similarly, the syntax for `MPI_Recv` is:

```c
int MPI_Recv(void* data, int count, MPI_Datatype datatype, int source, int tag,
         MPI_Comm communicator, MPI_Status* status);
```
   
A simple 2-process send-receive code is as follows:

```c
int data;
if (rank == 0) {
    data = -1;
    MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (rank == 1) {
    MPI_Recv(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
```

There are several other functions to send and receive data synchronously and asynchronously. In particular, we will make use of `MPI_SendRecv` function which sends and receives a message, and whose syntax is as follows:

```c
int MPI_Sendrecv(const void *sendbuf, int sendcount, MPI_Datatype sendtype,
                int dest, int sendtag,
                void *recvbuf, int recvcount, MPI_Datatype recvtype,
                int source, int recvtag,
                MPI_Comm comm, MPI_Status *status);
```

### Collective communication

Collective communication involves participation of all processes in a communicator. It implies an implicit synchronization point among processes. Depending on the requirement, we can peform broadcast, scatter, gather, reduce, and other operations between the participating processes. 

In our application, we would like to reduce all the rank-local norms to a single global norm using the sum operation. We use the `MPI_Allreduce` function for it which combines and reduces values from all processes and distributes the result back to all processes, and whose syntax is as follows:

```c
int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count,
                  MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
```

The `op` in our case will be `MPI_SUM`.

## Communication Models

We will use multiple ranks within our program as we will use multiple nodes. There are three major approaches to handle GPUs within a node:

1. Single GPU per rank
  * One process controls one GPU.
  * Easier to program and understand.
  * We can re-use our domain decomposition approach.


2. Multiple GPUs per rank
  * Usually, all GPUs within a node are handled by one process.
  * Coordinating between GPUs is quite tricky as CUDA-based communication is intertwined with MPI communication.
  * Requires a new decomposition for the two-tier communication hierarchy (MPI and CUDA).


3. Single GPU per multiple ranks
  * Multiple processes use the same GPU and number of processes in a node is usually equal to number of cores.
  * Intended for heterogeneous codes where both CPU and GPU accelerate the application.
  * CUDA Multi-Process-Service (MPS) is required to allow multiple CUDA processes to share a single GPU context.
  
We will take the first approach due to its simplicity (which eliminates approach #2) and because our application doesn't utilize CPU for compute (which eliminates approach #3). Thus our rank (core) to GPU mapping is one-to-one, as follows:

![mpi_overview](../../images/mpi_overview.png)

### Nodel-Level Local Rank

As we will run on multiple nodes, for example 2 nodes, the number of processes launched will be 16 ( assuming 8 GPU per node like in DGX). This requires addional mapping of process id to GPU Device ID, which runs from 0 to 7 on each node. Thus, we need to create a local rank at the node level.

To achieve this, we split the `MPI_COMM_WORLD` communicator between the nodes and store it in a `local_comm` communicator. Then, we get the local rank by calling the familiar `MPI_Comm_rank` function. Finally, we free the `local_comm` communicator as we don't require it anymore. 

The code snippet to obtain the `local_rank` at each node level is as follows:

```c
int local_rank = -1;
MPI_Comm local_comm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, rank, MPI_INFO_NULL, &local_comm);
MPI_Comm_rank(local_comm, &local_rank);
MPI_Comm_free(&local_comm);
```

## Implementation Exercise: Part 1

### Code Structure

Open the [jacobi_memcpy_mpi.cpp](../../source_code/mpi/jacobi_memcpy_mpi.cpp) file and the [jacobi_kernels.cu](../../source_code/mpi/jacobi_kernels.cu) files. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `jacobi_memcpy_mpi.cpp` and `jacobi_kernels.cu` files.

We separate the device kernels from other CUDA and MPI functions as `nvc++` compiler is required to compile CUDA C++ which may not be installed on some platforms Note that NVIDIA's HPC SDK includes the `nvc++` compiler.

Review the [Makefile](../../source_code/mpi/Makefile) to see that we compile the CUDA kernels using `nvcc` and link the object file with `jacobi_memcpy_mpi.cpp` using `mpicxx` compiler as follows:

```bash
# Compiling jacobi_kernels.cu
nvcc -gencode arch=compute_80,code=sm_80 -std=c++14 jacobi_kernels.cu -c
# Compiling and linking with jacobi_cuda_aware_mpi.cpp
mpicxx -I${CUDA_HOME}/include -fopenmp -std=c++14 jacobi_cuda_aware_mpi.cpp jacobi_kernels.o \
        -L${CUDA_HOME}/lib64 -lcudart -lnvToolsExt -o jacobi_cuda_aware_mpi
```

The device kernels are same as in previous labs. Open `jacobi_memcpy_mpi.cpp` file and understand the flow of the program. In particular, observe the following:

1. `local_rank` is used to set the current GPU device.
2. Device kernel calls have been replaced with function wrappers for ease of compilation.
3. Rank 0 is used to calculate efficiency and other metrics, even though all ranks compute `single_gpu` function to verify multi-GPU implementation's correctness.
4. In the first set of halo exchanges, `top_halo_buf` stores the top halo copied from the device on the host which is then sent to top neighbour. Whereas `bot_halo_buf` stores the updated bottom halo received from bottom neighbour that is then copied to the device from the host.
5. In the second set of halo exchanges, `top_halo_buf` stores the updated top halo received from the top neighbour that is then copied to the device from the host. Whereas `bot_halo_buf` stores the bottom halo copied from the device to the host that is then sent to the bottom neighbour.
6. Each halo exchange is wrapped in NVTX "Halo exchange Memcpy+MPI" for ease of viewing in profiler.

### To-Do

Now, implement the following marked as `TODO: Part 1-`:

* Obtain the node-level local rank by splitting the global communicator.
* Implement the MPI portion of first set of halo exchanges using `MPI_SendRecv` as explained above.
* Implement the Memcpy operations and MPI calls for the second set of halo exchanges. Recall why `cudaMemcpyAsync` is not the correct way of implementing this MPI program.
* Reduce the rank-local L2 Norm to a global L2 norm using `MPI_Allreduce` function.

After implementing these, compile the program:

In [None]:
!cd ../../source_code/mpi && make clean && make jacobi_memcpy_mpi

Ensure there are no compilation errors. Now, let us validate the program. 

The grid-size of 16384$\times$16384 will fully utilize all 8 GPUs. To test with 16 GPUs, we can increase the grid size to 16384$\times$32768 to maintain the invariant that GPUs are not under-utilized. Note that the halo exchange copy size remains the same as before (16K elements * size of float (4B) = 64KB).

**Due to limited resources, we will be using smaller grid ($2K\times4K$) using 2 GPUs.**

- **To run on half a node with 4 GPUs with $4K\times8K$ grid size, use: `mpirun -np 4 --map-by ppr:4:socket ./jacobi_memcpy_mpi -nx 4096 -ny 8192.`**
- **To run on full node with 8 GPUs with $16K\times16K$ grid size, use: `mpirun -np 8 --map-by ppr:4:socket ./jacobi_memcpy_mpi -ny 16384.`**
- **To run across 2 nodes with 16 GPUs with $16K\times32K$ grid size, use: `mpirun -np 16 --map-by ppr:4:socket ./jacobi_memcpy_mpi -ny 32768` inside batch script or simply run `srun --partition=gpu  --nodes=2 --gres=gpu:8  --ntasks=16 --ntasks-per-node=8 --mpi=pmix --ntasks-per-socket=4 ./jacobi_memcpy_mpi -ny 32768` on the command line** ***<mark>NOTE: If resources are available</mark>***

Run the program with 2 processes:

In [None]:
!cd ../../source_code/mpi && srun --partition=gpu -n1 --gres=gpu:2  --ntasks=2 --ntasks-per-node=2 --mpi=pmix --ntasks-per-socket=2 ./jacobi_memcpy_mpi -nx 2048 -ny 4096

### DGX system with 8 Ampere A100
Partial results obtained from a DGX system with 8 A100s:

Using 2 GPUs (2K$\times$4K grid size)
```bash
Num GPUs: 2.
2048x4096: 1 GPU:   0.1366 s, 2 GPUs:   0.1258 s, speedup:     1.09, efficiency:    54.28 
```

Using 4 GPUs, half a node (4K$\times$8K grid size)
```bash
Num GPUs: 4.
4096x8192: 1 GPU:   0.4438 s, 4 GPUs:   0.1889 s, speedup:     2.35, efficiency:    58.74 
```

Using 8 GPUs, full node (16384$\times$16384 grid size)
```bash
Num GPUs: 8.
16384x16384: 1 GPU:   3.3022 s, 8 GPUs:   0.6327 s, speedup:     5.22, efficiency:    65.24 
```

Using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size)
```bash
Num GPUs: 16.
16384x32768: 1 GPU:   6.5526 s, 16 GPUs:   0.6500 s, speedup:    10.08, efficiency:    63.01 
```

### DGX system with 8 Tesla V100
Partial results obtained from a DGX system with 8 Tesla V100s:

Using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size):

```bash
Num GPUs: 16.
16384x32768: 1 GPU:   8.9057 s, 16 GPUs:   0.7695 s, speedup:    11.57, efficiency:    72.34 
```

Using 4 nodes connected by InfiniBand (IB) NICs (16K$\times$64K grid size, $4\times$ the single-node's grid size):
```bash
Num GPUs: 32.
16384x65536: 1 GPU:  17.6316 s, 32 GPUs:   0.8526 s, speedup:    20.68, efficiency:    64.62
```

As the communication overhead increases due to more inter-node communication, the speed-up obtained and thus the efficiency of the application decreases. Nonetheless, our program can scale across mutliple nodes.

### OpenMPI Process Mappings

As we mentioned in previous labs, there are multiple ways to specify the number of processes to be run on each socket, node, etc. One such way is to use `--map-by` option. Mapping assigns a default location to each process.  To specify that we want each socket to run 4 processes, we use `--map-by ppr:4:socket` flag. Here, `ppr` stands for processes-per-resource, where the spcified resource is `socket` and the spcified number of processes is `4`. It is similar to using the `-npersocket 4` option. 

<mark>When launching tasks via slurm, we use `--ntasks-per-socket` instead of `-npersocket` to specify the number of tasks to invoke on each socket and we use `--ntasks-per-socket`, instead of `--map-by ppr:4:socket` to specify number of tasks per socket.</mark> Feel free to review the list of common [Slurm flags](https://slurm.schedmd.com/mc_support.html#flags).

If using `mpirun` inside batch script, you can verify this by using `mpirun -np 16 --map-by ppr:4:socket ./jacobi_memcpy_mpi -ny 32768` to run the executable across 2 nodes (8 tasks per node).

We can also use the `--map-by ppr:8:node:4:socket` flag with `mpirun`. This allows us to specify the number of processes per socket as well as the number of processes per node. This should result in the same execution and results. 

If using `mpirun` inside batch script, you can verify this by using `mpirun -np 16 --map-by ppr:8:node:4:socket ./jacobi_memcpy_mpi -ny 32768` to run the executable across 2 nodes (8 tasks per node).

Example commands to run the executable using `srun` on multinode (if recourses permit)

- Without mapping tasks to node and socket: `srun --partition=gpu  --nodes=2 --gres=gpu:8  --ntasks=16 --mpi=pmix ./jacobi_memcpy_mpi -ny 32768`
- With mapping tasks to node and socket: `srun --partition=gpu  --nodes=2 --gres=gpu:8  --ntasks=16 --mpi=pmix --ntasks-per-socket=4 --ntasks-per-node=8 ./jacobi_memcpy_mpi -ny 32768`

If we re-run the executable with and without the `--map-by ppr:8:node:4:socket` flag and compare the results, we notice the increase in multi-node execution time and corresponding decrease in efficiency. Check the below partial results obtained for $16K\times32K$ grid size over 2 nodes:

### DGX system with 8 Ampere A100
Partial results obtained from a DGX system with 8 A100s using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size):

```bash
# with mapping tasks to nodes and socket
Num GPUs: 16.
16384x32768: 1 GPU:   6.5668 s, 16 GPUs:   0.6528 s, speedup:    10.06, efficiency:    62.87 
```

```bash
# without mapping tasks to nodes and socket
Num GPUs: 16.
16384x32768: 1 GPU:  15.4872 s, 16 GPUs:   1.4351 s, speedup:    10.79, efficiency:    67.45 
```

### DGX system with 8 Tesla V100
Partial results obtained from a DGX system with 8 Tesla V100s using 2 nodes connected by InfiniBand (IB) NICs (16384$\times$32768 grid size):

```bash
# with mapping tasks to nodes and socket
Num GPUs: 16.
16384x32768: 1 GPU:   8.9050 s, 16 GPUs:   0.8150 s, speedup:    10.93, efficiency:    68.2
```

```bash
# without mapping tasks to nodes and socket
Num GPUs: 16.
16384x32768: 1 GPU:   8.9057 s, 16 GPUs:   0.7695 s, speedup:    11.57, efficiency:    72.34 
```

Let us check what cores or sockets or nodes each process (or MPI rank) is bound to. Binding constrains each process to run on specific processors. We use the `--report-bindings` option to check this. 

**NOTE:** To check the bindings, one can simply use `mpirun -np 16 --map-by ppr:8:node:4:socket --report-bindings ./jacobi_memcpy_mpi -ny 32768` inside batch script to run the executable on 2 nodes (if resources are available) or use `salloc` to allocate resources and then run `mpirun`.

When we check the bindings, the output may seem cluttered. Let's look at the partial output from ranks 0 and 1:

```bash
[<node_0_name>:<proc_id>] MCW rank 0 bound to socket 0 ... [BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../../../../../../../../../../../..]
[<node_0_name>:<proc_id>] MCW rank 1 bound to socket 1 ... [../../../../../../../../../../../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
```

Rank 0 is bound to all cores on socket 0 on node 0 while rank 1 is bound to all cores on socket 1 on node 0. 


Partial example results obtained from a DGX system with 8 A100s using 1 node (using `mpirun -np 16 --map-by ppr:8:node:4:socket --report-bindings ./jacobi_memcpy_mpi -ny 32768` for 16384$\times$32768 grid size:

```bash
[dgx02:2451349] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2451349] MCW rank 1 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx02:2451349] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2451349] MCW rank 3 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx02:2451349] MCW rank 4 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2451349] MCW rank 5 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx02:2451349] MCW rank 6 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2451349] MCW rank 7 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]

[dgx03:3172932] MCW rank 8 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx03:3172932] MCW rank 9 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx03:3172932] MCW rank 10 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx03:3172932] MCW rank 11 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx03:3172932] MCW rank 12 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx03:3172932] MCW rank 13 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx03:3172932] MCW rank 14 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx03:3172932] MCW rank 15 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
```

Clearly, this is not an optimal arrangement as halo exchanges have to cross socket boundaries for process. Now, if we check the process bindings in the previous case, we can see that ranks 0 and 1 are bound to the same socket in the same node. Moreover, ranks 3 and 4 are bound to different sockets (as `<procs_per_socket>` is 4) but bound to the same node, as desired.


Partial example results obtained from a DGX system with 8 A100s using 1 node (using `mpirun -np 16 --map-by ppr:4:socket --report-bindings ./jacobi_memcpy_mpi -ny 32768` for 16384$\times$32768 grid size:

```bash

[dgx01:3961004] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx01:3961004] MCW rank 1 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx01:3961004] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx01:3961004] MCW rank 3 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx01:3961004] MCW rank 4 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx01:3961004] MCW rank 5 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx01:3961004] MCW rank 6 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx01:3961004] MCW rank 7 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx02:2165925] MCW rank 8 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2165925] MCW rank 9 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2165925] MCW rank 10 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2165925] MCW rank 11 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB][../../../..]
[dgx02:2165925] MCW rank 12 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx02:2165925] MCW rank 13 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx02:2165925] MCW rank 14 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
[dgx02:2165925] MCW rank 15 bound to socket 1[core 4[hwt 0-1]], socket 1[core 5[hwt 0-1]], socket 1[core 6[hwt 0-1]], socket 1[core 7[hwt 0-1]]: [../../../..][BB/BB/BB/BB]
```

It is quite easy to end up in a sub-optimal process mapping by using simple OpenMPI flags and options. Thus, it is always advisible to double-check the process-to-core and process-to-socket bindings. Moving forward, we will use the `--map-by ppr:4:socket` option as evidently it results in desired process-to-core, socket, and node mapping.

### Profiling

We can profile an MPI program in two ways. To profile everything, putting the data in one file:

`nsys [nsys options] mpirun [mpi options] <program>`

To profile everything putting the data from each rank into a separate file:

`mpirun [mpi options] nsys profile [nsys options] <program>`

We will use the latter approach as it produces a single report and is more convenient to view. The host compute nodes need a working installation of Nsight Systems.

Let's profile the application using `nsys` (on 4 GPUs): 

In [None]:
!cd ../../source_code/mpi && sbatch profiling_4g

To view the profiler report, you would need to Download and save the report file by holding down <mark>Shift</mark> and <mark>Right-Clicking</mark> [Here](../../source_code/mpi/output_profiler/jacobi_memcpy_mpi_report.nsys-rep) and choosing Save Link As. Once done open the report via the GUI.

<!--You may notice that only 8 MPI processes are visible even though we launched 16 MPI processes. Nsight Systems displays the output from a single node and inter-node transactions (copy operations) are visible. This is for ease of viewing and doesn't impede our analysis.-->

Below is an example report (using 2 nodes):

![mpi_memcpy_overview](../../images/mpi_memcpy_overview.png)

Observe the following in the Timeline snapshot:

* Two sets of halo exchanges take place, each consisting of DtoH and HtoD CUDA Memcpy with an `MPI_Sendrecv` call in between for inter-process communication followed by an `MPI_Allreduce` call. 
* Each halo exchange takes about $45\mu$s in hardware and about $60\mu$s overall including the software overhead.
* The time between two Jacobi kernel iterations is about $200\mu$s.

However, if you scroll back in time, you might notice that not all halo exchanges take $60\mu$s. For example, here's a snapshot from near the beginning of the multi-GPU Jacobi iteration loop:

![mpi_memcpy_large_time](../../images/mpi_memcpy_large_time.png)

Here, the halo exchange takes about $1100\mu$s. MPI uses a lot of heuristics to fine-tune its call-stack and communication protocol to enhance performance. Therefore, we observe the behavior shown above where initially MPI calls take significant time but it improves in subsequent iterations.

**Solution:** The solution for this exercise is present in `source_code/mpi/solutions` directory: [jacobi_memcpy_mpi.cpp](../../source_code/mpi/solutions/jacobi_memcpy_mpi.cpp).

Note that our current implementation uses explicit host-staging for every halo copy operation. From our previous labs, we know that within a node, GPU-to-GPU communication can bypass host-staging and we implemented it using DtoD CUDA Memcpy with P2P enabled. Certainly, eliminating host-staging should improve performance. There are also inter-node communication optimizations that we can employ. 

We will learn more about both intra-node and inter-node GPU-centric MPI communication optimizations in the next lab where we will work with CUDA-aware MPI. Click below to move to the next lab:

# [Next: CUDA-aware MPI](../mpi/cuda_aware.ipynb)

Here's a link to the home notebook through which all other notebooks are accessible:

# [HOME](../../../start_here.ipynb)

---
## Links and Resources

* [Programming Concepts: MPI Point-to-Point Communication](https://cvw.cac.cornell.edu/mpip2p/p2pdef)
* [Programming Concepts: MPI Collective Communication](https://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture29.pdf)
* [Programming Concepts: NVIDIA Multi-Process Service](https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf)
* [Documentation: MPI Processing Mapping, Ranking, and Binding](https://www.open-mpi.org/doc/current/man1/mpirun.1.php#sect12)
* [Code: Multi-GPU Programming Models](https://github.com/NVIDIA/multi-gpu-programming-models)
* [Code: GPU Bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/)

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

## Licensing
Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.
