In [None]:
from ipyparallel import Client
rc = Client(profile="default")
rc.ids

Heat uses PyTorch and mpi4py to enable memory-distributed array operations on multi-node (including multi-GPU) systems. Let's see what this means in practice.



In [17]:
import numpy as np
import torch

array = np.arange(60).reshape(5,4,3)
tensor = torch.arange(60).reshape(5,4,3)

tensor  

tensor([[[ 0,  1,  2],
         [ 3,  4,  5],
         [ 6,  7,  8],
         [ 9, 10, 11]],

        [[12, 13, 14],
         [15, 16, 17],
         [18, 19, 20],
         [21, 22, 23]],

        [[24, 25, 26],
         [27, 28, 29],
         [30, 31, 32],
         [33, 34, 35]],

        [[36, 37, 38],
         [39, 40, 41],
         [42, 43, 44],
         [45, 46, 47]],

        [[48, 49, 50],
         [51, 52, 53],
         [54, 55, 56],
         [57, 58, 59]]])

Heat implements numpy's API as far as possible. We can create a Heat array (officially `DNDarray` or distributed n-dimensional array) using with the same functions that we use to create numpy arrays. We'll create a 3D DNDarray of integers ranging from 0 to 59 (5 matrices of size (4,3)).

In [5]:
#%%px
import heat as ht
dndarray = ht.arange(60).reshape(5,4,3)
dndarray

DNDarray([[[ 0,  1,  2],
           [ 3,  4,  5],
           [ 6,  7,  8],
           [ 9, 10, 11]],

          [[12, 13, 14],
           [15, 16, 17],
           [18, 19, 20],
           [21, 22, 23]],

          [[24, 25, 26],
           [27, 28, 29],
           [30, 31, 32],
           [33, 34, 35]],

          [[36, 37, 38],
           [39, 40, 41],
           [42, 43, 44],
           [45, 46, 47]],

          [[48, 49, 50],
           [51, 52, 53],
           [54, 55, 56],
           [57, 58, 59]]], dtype=ht.int32, device=cpu:0, split=None)

Notice the additional metadata printed with the DNDarray. With respect to a numpy ndarray, the DNDarray has additional information on the device (in this case, the CPU) and the `split` axis. In the example above, the split axis is `None`, meaning that the DNDarray is not distributed and each MPI process has a full copy of the data.

Let's experiment with a distributed DNDarray: we'll split the same DNDarrayas above, but distributed along the first axis.

In [None]:
%%px
dndarray = ht.arange(60, split=0).reshape(5,4,3)
dndarray

The `split` axis is now 0, meaning that the DNDarray is distributed along the first axis. Each MPI process has a slice of the data along the first axis. In order to see the data on each process, we can print the "local array" via the `larray` attribute.

In [7]:
%%px
dndarray.larray

tensor([[[ 0,  1,  2],
         [ 3,  4,  5],
         [ 6,  7,  8],
         [ 9, 10, 11]],

        [[12, 13, 14],
         [15, 16, 17],
         [18, 19, 20],
         [21, 22, 23]],

        [[24, 25, 26],
         [27, 28, 29],
         [30, 31, 32],
         [33, 34, 35]],

        [[36, 37, 38],
         [39, 40, 41],
         [42, 43, 44],
         [45, 46, 47]],

        [[48, 49, 50],
         [51, 52, 53],
         [54, 55, 56],
         [57, 58, 59]]], dtype=torch.int32)

Note that the `larray` is a `torch.Tensor` object. This is the underlying tensor that holds the data. The `dndarray` object is an MPI-aware wrapper around these process-local tensors, providing memory-distributed functionality and information.

The DNDarray can be distributed along any axis. Modify the `split` attribute in the cell above to distribute the DNDarray along a different axis, and see how the `larray`s change. You'll notice that the distributed arrays are always load-balanced, meaning that the data are distributed as evenly as possible across the MPI processes.

The `DNDarray` object has a number of methods and attributes that are useful for distributed computing. In particular, it keeps track of its global and local (on a given process) shape through distributed operations and array manipulations. The DNDarray is also associated to a `comm` object, which is an MPI communicator that allows the DNDarray to communicate with other DNDarrays. This is useful for distributed operations, such as reductions, scatter, gather, and all-to-all operations. 

In [13]:
%%px
print(f"Global shape of the dndarray: {dndarray.shape}")
print(f"On rank {dndarray.comm.rank}/{dndarray.comm.size}, local shape of the dndarray: {dndarray.lshape}")


Global shape of the dndarray: (5, 4, 3)
On rank 0/1, local shape of the dndarray: (5, 4, 3)


We can easily create DNDarrays from PyTorch tensors and numpy ndarrays. We can also convert DNDarrays to PyTorch tensors and numpy ndarrays. This makes it easy to integrate Heat into existing PyTorch and numpy workflows.

Finally, because the underlying data objects are PyTorch tensors, we can easily create DNDarrays on GPUs or move DNDarrays to GPUs. This allows us to perform distributed array operations on multi-GPU systems.

In this tutorial, you have a node of the JUWELS booster system available with 4 Nvidia A100 GPUs. You can create the DNDarray above on the GPUs by setting the `device`  attribute to "gpu". Note that Heat, like PyTorch, supports the ROCm ecosystem as well, so in principle you can also perform distributed array operations on systems with AMD GPUs.

In [None]:
%%px
dndarray = ht.arange(60, split=0, device="gpu").reshape(5,4,3)

dndarray.device


You can perform a vast number of operations on  DNDarrays. Check out our [Numpy coverage tables](https://github.com/helmholtz-analytics/heat/blob/main/coverage_tables.md) to see what operations are already supported. While we are on the GPUs, let's try a matrix multiplication of two large DNDarrays.

In [16]:
%%px
n, m = 40000, 40000
x = ht.random.randn(n, m, split=None, device="gpu")
y = ht.random.randn(m, n, split=None, device="gpu")
z = %timeit -n 1  x @ y
del x, y, z

103 ms ± 694 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


You can experiment with the split parameter (distribution axis) for both matrices and time the result. Note that:
- If you set `split=None` for both matrices, each process (in this case, each GPU) will attempt to multiply the entire matrices. You will notice that the GPU memory is insufficient.
- If `split` is not None for both matrices, each process will only hold a slice of the data, and will need to communicate data with other processes in order to perform the multiplication. This introduces communication overhead, but allows you to perform the multiplication on larger matrices than would fit in the memory of a single GPU.
- If `split` is None for one matrix and not None for the other, the multiplication does not require communication, and the result will be distributed. If your data size allows it, you should always favor this option.

Time the multiplication for different split parameters and see how the performance changes.

You can of course perform the same operations on CPUs. 