# CUDA for Python and MPI4Py

CUDA kernels cannot make any MPI calls. If you need to pass information from one GPU to another it's done by the main program. Here's a simple example:

In [1]:
import numpy
import mpi4py.MPI as MPI
from numba import cuda

Here's a simple kernel that shifts all values in v by a:

In [2]:
@cuda.jit
def shift(a, v):
    """Shift all values in v by a.
    
    Parameters
    ----------
    a: float
        shift
    v: numpy.ndarray
        one-dimensional array with values that are to be shifted.
    """
    i = cuda.grid(1)
    if i < v.shape[0]:
        v[i] += a

Now, we set up MPI, and generate random number on rank 0.

In [3]:
comm = MPI.COMM_WORLD
my_rank = comm.Get_rank()
number_of_ranks = comm.Get_size()
N = 1000
a_partial = numpy.empty(N)
if my_rank == 0:
    a = numpy.random.random(N * number_of_ranks)
else:
    a = numpy.empty(0)

Since we are dealing with NumPy arrays, we can use the efficient uppercase versions of the MPI calls. `Scatter` distributes an array evenly among all nodes. Note, the sendbuf only needs to be allocated on node zero, but the variable must exist everywhere.

In [4]:
comm.Scatter(a, a_partial, root = 0)

Setup and call the kernel.

In [5]:
block = 256
grid = N // block if N % block == 0 else N // block + 1 

scale[grid, block](-0.5, a_partial)

NameError: name 'scale' is not defined

Gather works the oposite way to Scatter. Again *a* only needs to have capacity on rank 0.

In [7]:
comm.Gather(a_partial, a, root = 0)

We generated a uniform distribution between 0 and 1 and then shifted it by -0.5. The mean value should now be close to zero.

In [8]:
print("The average of a is %.2f" % numpy.mean(a)) # Result should be near zero.

The average of a is 0.51


As you see from the example above, there's nothing special when using MPI with GPUs. The one thing that might bite you is using *multiple* MPI ranks for *multiple* GPUs on a *single* node. In this case, you might have to tell your MPI rank which GPU to use.

If you have, for example, 4 GPUs and you know that your scheduler chooses a compact configuration, i.e., rank 0, 1, 2, 3 are on the first node, rank 4, 5, 6, 7 are on the second node, etc., you can use you rank to assign a GPU to your process:

```python

cuda.select_device(my_rank % 4)
```

## Exercise: Shift

a) Write the above program to a file. Remember that you can move cells with the arrow buttons above and then merge them. There's a cell magic to write a cell to a file.

b) Run the program on one node with 4 MPI processes.

c) Add the cuda.select_device call to the program and run it again.

## Exercise: Multi-GPU Mandelbrot

Calculate the Mandelbrot set on multiple GPUs.