# Learning objectives

In this lab we will learn about:

* Multi-node Multi-GPU programming and importance of inter-process communication frameworks.
* Introduction to MPI specification and APIs.
* Execution of Hello World MPI binary on single as well as multiple nodes.

# Multi-Node Multi-GPU Programming

As we move from a single node to multiple nodes, the basic multi-GPU programming concepts like domain decomposition and application-specific concepts like halo exchange remain the same. However, the communication becomes complex.

A single process can spawn threads that can be spread within a node (potentially on multiple sockets) but it cannot cross the node boundary. Thus, scalable multi-node programming requires the use of multiple processes.

Inter-process communication is usually done by libraries like OpenMPI. They expose communication APIs, synchronization constructs, etc. to the user. Let us now learn about programming in MPI.

## MPI

MPI is a specification for the developers and users of message passing libraries. By itself, it is not a library - but rather the specification of what such a library should be. An example of MPI-compliant library is OpenMPI.

It primarily addresses the message-passing parallel programming model: data is moved from the address space of one process to that of another process through cooperative operations on each process.

MPI is widely used in practice for HPC applications, in academia, government agencies, and industry alike. In this lab, while we will introduce its APIs, a working understanding of MPI is highly desirable.

### A Hello World Example

A C-based Hello World program is shown below:

```c
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);
    // Get the number of processes
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    // Get the rank of the process
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);
    // Print a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, rank, size);
    // Finalize the MPI environment.
    MPI_Finalize();
}
```

To access the program, open the [hello_world.c](../../source_code/mpi/hello_world.c) file. Alternatively, you can navigate to `CFD/English/C/source_code/mpi/` directory in Jupyter's file browser in the left pane. Then, click to open the `hello_world.c` file.

The MPI environment is initialized with `MPI_Init` through which all of MPI’s global and internal variables are constructed. A "communicator" is created between all processes that are spawned, and unique ranks are assigned to each process. 

`MPI_Comm_size` returns the size of a communicator, that is, the number of processes within that communicator. In our example, this call will return the number of processes requested for the job.

`MPI_Comm_rank` returns the rank of a process in a communicator. Each process inside of a communicator is assigned an incremental rank starting from zero. The ranks of the processes are primarily used for identification purposes when sending and receiving messages.

`MPI_Get_processor_name` obtains the name of the processor on which the process is executing and `MPI_Finalize` is used to clean up the MPI environment. No more MPI calls can be made after this call.

## Running MPI with or without containers

**We will run MPI directly on compute nodes without using containers.** The subsequent sections assume that atleast 2 compute nodes with multiple GPUs in each node are available to the user. All our codes have been tested with CUDA-aware OpenMPI v4.1.1 with supporting libraries HPCX v2.8.1 (for UCX and HCOLL) and CUDA v11.3.0.0 on DGX-1 8 Tesla V100 compute nodes as well as DGX with 8 Ampere A100 80GB nodes (OpenMPI v4.1.1, HPC SDK 22.7, HPCX 2.11).

CUDA-awareness as a concept in MPI will be explained in subsequent labs.

Usually, a cluster workload manager like Slurm or PBS is present and integrated with MPI installation to launch multi-node jobs. We use `mpirun` command to run MPI assuming that the user is logged into an interactive shell with multiple nodes allocated. The other common way is to use workload manager commands like `srun` (for Slurm) directly to run MPI jobs as they are integrated with MPI internally. 

**Note:** We do outline the method to build and run containerized MPI using Singularity in tandem with host MPI implementation in our supplemental notebook: [MPI in a containerized environment](./containers_and_mpi.ipynb). **This is for your own reference only. Please do not try the container version during the event.**

### Compilation

The `mpicc` and `mpic++` (or `mpicxx`) compilers are used to compile and link programs with MPI. We can compile the Hello World program with the command:

`mpicc -o hello_world hello_world.c`

Ensure that MPI is installed (for exmaple, if it is built from source) and available (for example, if loaded as a module) using the folllowing command:

In [None]:
!mpirun --version

Expected output:

```bash
mpirun (Open MPI) 4.1.1

Report bugs to http://www.open-mpi.org/community/help/
```

Now, let us compile the program:

In [None]:
!cd ../../source_code/mpi && make clean && make hello_world

### Execution

We run the program using the `mpirun` command as follows:

`mpirun -np <procs> -npersocket <procs_per_socket> -hostfile <host_file> ./hello_world`

***<mark>NOTE: On the CURIOSITY cluster, Slurm directly launches the tasks and performs initialization of communications through the PMIx APIs (supported by most modern MPI implementations.). In other words, when using `srun`, we do not use `mpirun` and instead we use `--mpi=pmix`. This might be different on other machines.***</mark> 

The `-np` option specifies the total number of processes spawned by MPI runtime and `-npersocket` option specifies the number of processes to be spawned on each socket. The `-hostfile` option allows us to specify which hosts (compute nodes) to start MPI processes on. The file is a newline-separated list of hostnames which must be accessible to each other so that MPI processes can communicate.

<mark>When launching tasks via slurm, we use `--ntasks-per-socket` instead of `-npersocket` to specify the number of tasks to invoke on each socket.</mark> Feel free to review the list of common [Slurm flags](https://slurm.schedmd.com/mc_support.html#flags).

Note that DGX-1V is a dual-socket system and `<procs_per_socket>` should be less than or equal to number of cores in that socket. Clearly, `<procs>`$\div$(`procs_per_socket`$\times$`<sockets_per_node>`) is the number of nodes used. There are several other options available to specify `<procs_per_socket>` that will be discussed in subsequent labs. As we are using an OpenMPI implementation in a workload manager-based environment, the `<host_file>` will be provided by Slurm and we don't need to specify this option.

There are numerous other configuration options that one can overview using the `mpirun --help` command. You can check the number of sockets and cores per socket in your machine (the whole node) with the command `lscpu | grep -E 'Socket|Core'`. 

### Single Node

Run the program binary on half a node:

In [None]:
!cd ../../source_code/mpi && srun --partition=gpu --nodes=1 --gres=gpu:4 --ntasks=2 --ntasks-per-node=2 --mpi=pmix --ntasks-per-socket=2 ./hello_world

You may see some warnings. As long as the output is printed, you can ignore the warnings. In the output, you should see 2 unique ranks ranging (0 and 1) and the node's name like below:

```bash
Hello world from processor <host_name_0>, rank 0 out of 2 processors
Hello world from processor <host_name_1>, rank 1 out of 2 processors
```

Example output on half a node (`--ntasks=2` and 2 tasks per socket):

```bash
Hello world from processor dgx01, rank 1 out of 2 processors
Hello world from processor dgx01, rank 0 out of 2 processors
```

Example output on full node (`--ntasks=2` and 1 task per socket):

```bash
Hello world from processor dgx03, rank 1 out of 2 processors
Hello world from processor dgx03, rank 0 out of 2 processors
```

Example output on full node (`--ntasks=4` and 2 tasks per socket):

```bash
Hello world from processor dgx02, rank 1 out of 4 processors
Hello world from processor dgx02, rank 0 out of 4 processors
Hello world from processor dgx02, rank 3 out of 4 processors
Hello world from processor dgx02, rank 2 out of 4 processors
```

### Multiple Nodes

Now, let's run the Hello World program on 2 nodes:

In [None]:
!cd ../../source_code/mpi && srun --partition=gpu  --nodes=2 --gres=gpu:8  --ntasks=4 --ntasks-per-node=2 --mpi=pmix --ntasks-per-socket=1 ./hello_world

The output, excluding warnings, should be as follows (the order of output lines is not important):

```bash
Hello world from processor <node_0_name>, rank 1 out of 4 processors
Hello world from processor <node_1_name>, rank 0 out of 4 processors
Hello world from processor <node_1_name>, rank 3 out of 4 processors
Hello world from processor <node_0_name>, rank 2 out of 4 processors
```

Example output on 2 nodes (`--ntasks=4`, 2 tasks per node, and 1 task per socket):

```bash
Hello world from processor dgx01, rank 0 out of 4 processors
Hello world from processor dgx02, rank 2 out of 4 processors
Hello world from processor dgx01, rank 1 out of 4 processors
Hello world from processor dgx02, rank 3 out of 4 processors
```

Example output on 2 nodes (`--ntasks=4`, 2 tasks per node):

```bash
Hello world from processor dgx02, rank 3 out of 4 processors
Hello world from processor dgx01, rank 2 out of 4 processors
Hello world from processor dgx01, rank 1 out of 4 processors
Hello world from processor dgx01, rank 0 out of 4 processors
```

Example output on 2 nodes (`--ntasks=16`, 8 tasks per node):

```bash
Hello world from processor dgx01, rank 0 out of 16 processors
Hello world from processor dgx01, rank 1 out of 16 processors
Hello world from processor dgx01, rank 2 out of 16 processors
Hello world from processor dgx01, rank 3 out of 16 processors
Hello world from processor dgx01, rank 4 out of 16 processors
Hello world from processor dgx01, rank 5 out of 16 processors
Hello world from processor dgx01, rank 7 out of 16 processors
Hello world from processor dgx01, rank 6 out of 16 processors
Hello world from processor dgx02, rank 8 out of 16 processors
Hello world from processor dgx02, rank 9 out of 16 processors
Hello world from processor dgx02, rank 10 out of 16 processors
Hello world from processor dgx02, rank 11 out of 16 processors
Hello world from processor dgx02, rank 14 out of 16 processors
Hello world from processor dgx02, rank 13 out of 16 processors
Hello world from processor dgx02, rank 15 out of 16 processors
Hello world from processor dgx02, rank 12 out of 16 processors
```

**Note:** Subsequent labs will assume the reader understands how to run a multi-node MPI job.

Now, let us learn more MPI concepts and code a CUDA Memcpy and MPI-based Jacobi solver. Click below to move to the next lab:

# [Next: CUDA Memcpy with MPI](../mpi/memcpy.ipynb)

Here's a link to the home notebook through which all other notebooks are accessible:

# [HOME](../../../start_here.ipynb)

---
## Links and Resources

* [Programming: MPI Hello World Tutorial](https://mpitutorial.com/tutorials/mpi-hello-world/)
* [Programming: OpenMPI Library](https://www.open-mpi.org/)
* [Concepts: Singularity Containers with MPI](https://sylabs.io/guides/3.6/user-guide/mpi.html)
* [Documentation: mpirun Command](https://www.open-mpi.org/doc/current/man1/mpirun.1.php)
* [Code: Multi-GPU Programming Models](https://github.com/NVIDIA/multi-gpu-programming-models)
* [Code: GPU Bootcamp](https://github.com/gpuhackathons-org/gpubootcamp/)

Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

## Licensing
Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.
