# MPI in a containerized environment

Running MPI in a containerized environment across multiple nodes requires cooperation between MPI implementation within the container with the implementation on the host compute nodes. Moreover, usually a workload manager like Slurm is used to manage a cluster and it is integrated with the host's MPI implementation. Thus, it leaves us with different models to run containerized multi-node MPI jobs. 

We will follow the approach that relies the least on the host. In this model, we can run MPI on a single node from within the container without any assistance from the host MPI. Multi-node runs require launching the job from host nodes and is discussed in the next section.

**Please do not attempt to run the cells during the event, the instructions are for containerized environment only**

## Multi-node containerized MPI concepts

Multi-node execution requires that both the host nodes and the Singularity containers have compatible MPI installations that support the cluster's workload manager like Slurm or PBS. Specifically, our container has the following OpenMPI implementation:

* OpenMPI 4.1.1 with the following packages: 
  - HPCX 2.8.1 for UCX and HCOLL
  - CUDA 11.3.0.0
  - PMI 2 support compatible with host

All host nodes should have this implementation as well. The MPI process outside of the container will then work in tandem with MPI inside the container and the containerized MPI code to instantiate the job. The Open MPI/Singularity workflow is as follows:

1. The MPI launcher (e.g., `mpirun`, `mpiexec`) is called by the user directly from a host-based shell.
2. Open MPI then calls the process management daemon (ORTED).
3. The ORTED process launches the Singularity container requested by the launcher command.
4. Singularity instantiates the container and namespace environment.
5. Singularity then launches the MPI application within the container.
6. The MPI application launches and loads the Open MPI libraries.
7. The Open MPI libraries connect back to the ORTED process via the Process Management Interface (PMI).

![mpi_container_setup](../../images/mpi_container_setup.png)

### Building the Singularity container

Let us ensure the Singularity container is correctly built. Before using the `singularity build ...` command, the `slurm_pmi_config` directory must be filled with PMI header files and libraries copied from the host compute nodes. They are then used to build the OpenMPI implementation inside the container. Thus, the host's and container's MPI can communicate with each other.

In the compute nodes, the header files are usually present in `/usr/local/include/slurm/` or `/usr/include/slurm/` or similar directories and are named `pmi.h`, `pmix.h`, `pmi2.h`, etc. Similarly, the library `.so` files are present in `/usr/local/lib/` or `/usr/local/lib/slurm/` or similar directories and are named `libpmi.so`, `libpmi2.so`, `mpi_pmix.so`, etc. Copies these header files and libraries to `slurm_pmi_config/include/` and `slurm_pmi_config/lib/` directories, respectively.

Now, let us move on to compiling and executing the codes. We will follow the example of our [hello_world.c](../../source_code/mpi/hello_world.c) program.

## Compilation

The `mpicc` and `mpic++` (or `mpicxx`) compilers are used to compile and link programs with MPI. We can compile the `hello_world.c` program with the command:

```bash
mpicc -o hello_world hello_world.c
```

We can compile MPI programs directly from within the container using an interactive shell. In the [Makefile](../../source_code/mpi/Makefile) within MPI's source code directory, make sure to uncomment the `CUDA_HOME` variable export command.

Now, compile the program:

In [None]:
!cd ../../source_code/mpi && make clean && make hello_world

## Execution

### Single Node

Run the program binary on a single node from within the container:

In [None]:
!cd ../../source_code/mpi && mpirun -np 2 -npersocket 1 ./hello_world

You may see some warnings (like `...UCX  WARN  transport 'gdr_copy' is not available...`). As long as the output is printed, you can ignore the warnings. Optionally, you can add `-x UCX_TLS=rc,mm,cuda_copy,cuda_ipc` flag after `mpirun` command to suppress the warning. We will discuss the flag in subsequent labs.

In the output, you should see 2 unique ranks ranging (0 and 1) and the node's name like below:

```bash
Hello world from processor <node_name>, rank 0 out of 2 processors
Hello world from processor <node_name>, rank 1 out of 2 processors
```

To run this program across nodes (say, by using `-np 4 -npernode 2` flags), we need to do a bit of system setup which is explained below.

### Multiple Nodes

To run a multi-node MPI program, go to a login-node shell in your cluster, and:

1. Request multiple compute nodes and an interactive shell. For example `srun -N 2 -J mpi --ntasks-per-node=8 --cores-per-socket=20 --exclusive --pty bash -i` 
2. Ensure that the correct OpenMPI (specified above) is built and/ or loaded. For example using `ompi_info --version` and `ucx_info -v`
3. Launch Singularity using `mpirun`.

**Note:** This model assumes the MPI on host will internally use the workload manager (Slurm/ PBS) to allocate resources. If that's not the case, `srun` command in Slurm can be used in step 3 to launch processes as well.

Launching Singularity using `mpirun` is done as follows:

```bash
mpirun -np <procs> -npernode <proces_per_node> -npersocket <procs_per_socket> \
       singularity exec --nv <singularity_image_location_on_host> <program_binary_location_on_host>
```

Let us now run the Hello World program on 2 nodes with the following command:

```bash
mpirun -np 4 -npernode 2 -npersocket 1 singularity exec <image_location_on_host> <hello_world_binary_on_host>
```

The output, excluding warnings, should be as follows (the order of output lines is not important):

```bash
Hello world from processor <node_0_name>, rank 1 out of 4 processors
Hello world from processor <node_1_name>, rank 0 out of 4 processors
Hello world from processor <node_1_name>, rank 3 out of 4 processors
Hello world from processor <node_0_name>, rank 2 out of 4 processors
```

#### CUDA-aware MPI programs

Now, let us compile and run a CUDA-aware MPI code on multiple nodes. We will copy the `jacobi_cuda_aware_mpi.cpp` file from `solutions` directory into the `mpi` directory. Then,  we compile the program. Run the command below:

In [None]:
! cd ../../source_code/mpi/containerization && make jacobi_cuda_aware_mpi

Now, run the program binary with 16 processes across 2 nodes as follows:

```bash
mpirun -np 16 -npernode 8 -npersocket 4 singularity exec <image_location_on_host> <jacobi_memcpy_mpi_binary_on_host>
```

## Profiling

We can profile an MPI program in two ways. To profile everything, putting the data in one file:

```bash
nsys [nsys options] mpirun [mpi options] <program>
```

To profile everything putting the data from each rank into a separate file:

```bash
mpirun [mpi options] nsys profile [nsys options] <program>
```

We will use the latter approach as it produces a single report and is more convenient to view. Moreover, as we are running MPI inside a container, the host compute nodes need a working installation of Nsight Systems version 2020.5.1 that can be downloaded from here: [LINK](https://developer.nvidia.com/gameworksdownload#?dn=nsight-systems-2020-5-1-83).

Let's profile the CUDA-aware MPI program binary using `nsys`: 

```bash
nsys profile --trace=mpi,cuda,nvtx --stats=true --force-overwrite true -o jacobi_cuda_aware_mpi_report \
    mpirun -np 16 -npernode 8 -npersocket 4 \
        singularity exec --nv -e --env LD_LIBRARY_PATH=/opt/nvidia/nsight-systems/2020.5.1/target-linux-x64:${LD_LIBRARY_PATH} \
            <singularity_image_location_on_host> <jacobi_cuda_aware_mpi_location_on_host> -ny 32768
```

Note the following in the command-line options:
* The `nsys [nsys options]` command is same as in the single node case.
* The `mpirun [mpi options]` command is also same as before.
* The `singularity exec` command has two addition options:
  - `-e` flag instantiates the container with a clean environment, free from host-defined variables. However, Nsight System requires `LD_PRELOAD` variable to preload the instrumentation library dynamically.
  - `--env LD_LIBRARY_PATH=...` flag specifies the updated library path variable so the required libraries can be loaded.
  
Download the report and view it via the GUI. You may notice that only 8 MPI processes are visible even though we launched 16 MPI processes. Nsight Systems displays the output from a single node and inter-node transactions (copy operations) are visible. This is for ease of viewing and doesn't impede our analysis.

Before moving on, let us restore the question code and remove the solution code:

In [None]:
!cd ../../source_code/mpi/containerization && make clean

Now that our system setup is functional, click below to go to HOME (Introduction) and begin MPI labs:

# [HOME](../../../introduction.ipynb)

---

## Licensing
Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.
 