# Heat Basics

---
## What is Heat for?



Straight from our [GitHub repository](https://github.com/helmholtz-analytics/heat):

Heat builds on [PyTorch](https://pytorch.org/) and [mpi4py](https://mpi4py.readthedocs.io) to provide high-performance computing infrastructure for memory-intensive applications within the NumPy/SciPy ecosystem.


With Heat you can:
- port existing NumPy/SciPy code from single-CPU to multi-node clusters with minimal coding effort;
- exploit the entire, cumulative RAM of your many nodes for memory-intensive operations and algorithms;
- run your NumPy/SciPy code on GPUs (CUDA, ROCm, limited support of Apple MPS).


Why?

- significant **scalability** with respect to task-parallel frameworks;
- analysis of massive datasets without breaking them up in artificially independent chunks;
- ease of use: script and test on your laptop, port straight to HPC cluster; 
- PyTorch-based: GPU support beyond the CUDA ecosystem.

<div>
  <img src=https://raw.githubusercontent.com/helmholtz-analytics/heat/master/doc/source/_static/images/heatvsdask_strong_smalldata_without.png?raw=true title="Strong scaling CPU" width="30%" style="float:center"/>
  <img src=https://raw.githubusercontent.com/helmholtz-analytics/heat/master/doc/source/_static/images/heatvsdask_weak_smalldata_without.png?raw=true title="Weak scaling CPU" width="30%" style="float:center "/>
  <img src=https://raw.githubusercontent.com/helmholtz-analytics/heat/master/doc/source/_static/images/weak_scaling_gpu_terrabyte.png?raw=true title="Weak scaling GPU" width="30%" style="float:center"/>
</div>

## Connecting to ipyparallel cluster

We have started an `ipcluster` with 4 engines at the end of the [Setup notebook](0_setup/0_setup_local.ipynb).

Let's start the interactive session with a look into the `heat` data object. But first, we need to import the `ipyparallel` client.

In [None]:
from ipyparallel import Client
rc = Client(profile="default")
rc.ids

if len(rc.ids) == 0:
    print("No engines found")
else:
    print(f"{len(rc.ids)} engines found")

4 engines found


We will always start `heat` cells with the `%%px` magic command to execute the cell on all engines. However, the first section of this tutorial doesn't deal with distributed arrays. In these cases, we will use the `%%px --target 0` magic command to execute the cell only on the first engine.

---

## DNDarrays


Similar to a NumPy `ndarray`, a Heat `dndarray`  (we'll get to the `d` later) is a grid of values of a single (one particular) type. The number of dimensions is the number of axes of the array, while the shape of an array is a tuple of integers giving the number of elements of the array along each dimension. 

Heat emulates NumPy's API as closely as possible, allowing for the use of well-known **array creation functions**.

In [None]:
%%px 
import heat as ht
a = ht.array([1, 2, 3])
a


%px:   0%|          | 0/4 [00:00<?, ?tasks/s]





[0;31mOut[0:1]: [0mDNDarray([1, 2, 3], dtype=ht.int64, device=cpu:0, split=None)



In [None]:
%%px --target 0
a = ht.ones((4, 5,))

In [None]:
%%px --target 0
ht.arange(10)

[0;31mOut[0:3]: [0mDNDarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=ht.int32, device=cpu:0, split=None)

In [None]:
%%px --target 0
ht.full((3, 2,), fill_value=9)

[0;31mOut[0:4]: [0m
DNDarray([[9., 9.],
          [9., 9.],
          [9., 9.]], dtype=ht.float32, device=cpu:0, split=None)

## Data Types

Heat supports various data types and operations to retrieve and manipulate the type of a Heat array. However, in contrast to NumPy, Heat is limited to logical (bool) and numerical types (uint8, int16/32/64, float32/64, and complex64/128). 

**NOTE:** by default, Heat will allocate floating-point values in single precision, due to a much higher processing performance on GPUs. This is one of the main differences between Heat and NumPy.

In [None]:
%%px --target 0
a = ht.zeros((3, 4,))
a

[0;31mOut[0:5]: [0m
DNDarray([[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]], dtype=ht.float32, device=cpu:0, split=None)

In [None]:
%%px --target 0
b = a.astype(ht.int64)
b

[0;31mOut[0:6]: [0m
DNDarray([[0, 0, 0, 0],
          [0, 0, 0, 0],
          [0, 0, 0, 0]], dtype=ht.int64, device=cpu:0, split=None)

## Operations

Heat supports many mathematical operations, ranging from simple element-wise functions, binary arithmetic operations, and linear algebra, to more powerful reductions. Operations are by default performed on the entire array or they can be performed along one or more of its dimensions when available. Most relevant for data-intensive applications is that **all Heat functionalities support memory-distributed computation and GPU acceleration**. This holds for all operations, including reductions, statistics, linear algebra, and high-level algorithms. 

You can try out the few simple examples below if you want, but we will skip to the [Parallel Processing](#Parallel-Processing) section to see memory-distributed operations in action.

In [None]:
%%px --target 0
a = ht.full((3, 4,), 8)
b = ht.ones((3, 4,))

In [None]:
%%px --target 0
a + b

[0;31mOut[0:8]: [0m
DNDarray([[9., 9., 9., 9.],
          [9., 9., 9., 9.],
          [9., 9., 9., 9.]], dtype=ht.float32, device=cpu:0, split=None)

In [None]:
%%px --target 0
ht.sub(a, b)

[0;31mOut[0:9]: [0m
DNDarray([[7., 7., 7., 7.],
          [7., 7., 7., 7.],
          [7., 7., 7., 7.]], dtype=ht.float32, device=cpu:0, split=None)

In [None]:
%%px --target 0
ht.arange(5).sin()

[0;31mOut[0:10]: [0mDNDarray([ 0.0000,  0.8415,  0.9093,  0.1411, -0.7568], dtype=ht.float32, device=cpu:0, split=None)

In [None]:
%%px --target 0
a.T

[0;31mOut[0:11]: [0m
DNDarray([[8., 8., 8.],
          [8., 8., 8.],
          [8., 8., 8.],
          [8., 8., 8.]], dtype=ht.float32, device=cpu:0, split=None)

In [None]:
%%px --target 0
b.sum(axis=1)

[0;31mOut[0:12]: [0mDNDarray([4., 4., 4.], dtype=ht.float32, device=cpu:0, split=None)

---
Heat implements the same broadcasting rules (implicit repetion of an operation when the rank/shape of the operands do not match) as NumPy does, e.g.:

In [None]:
%%px --target 0
ht.arange(10) + 3

[0;31mOut[0:13]: [0mDNDarray([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=ht.int32, device=cpu:0, split=None)

In [None]:
%%px --target 0
a = ht.ones((3, 4,))
b = ht.arange(4)
c = a + b

a, b, c

[0;31mOut[0:14]: [0m
(DNDarray([[1., 1., 1., 1.],
          [1., 1., 1., 1.],
          [1., 1., 1., 1.]], dtype=ht.float32, device=cpu:0, split=None),
 DNDarray([0, 1, 2, 3], dtype=ht.int32, device=cpu:0, split=None),
 DNDarray([[1., 2., 3., 4.],
          [1., 2., 3., 4.],
          [1., 2., 3., 4.]], dtype=ht.float32, device=cpu:0, split=None))

## Indexing

Heat allows the indexing of arrays, and thereby, the extraction of a partial view of the elements in an array. It is possible to obtain single values as well as entire chunks, i.e. slices.

In [None]:
%%px
a = ht.arange(10)
a

[0;31mOut[0:15]: [0mDNDarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=ht.int32, device=cpu:0, split=None)







In [None]:
%%px
a[3]







[0;31mOut[0:16]: [0mDNDarray(3, dtype=ht.int32, device=cpu:0, split=None)

In [None]:
%%px
a[1:7]

[0;31mOut[0:17]: [0mDNDarray([1, 2, 3, 4, 5, 6], dtype=ht.int32, device=cpu:0, split=None)







In [None]:
%%px
a[::2]

[0;31mOut[0:18]: [0mDNDarray([0, 2, 4, 6, 8], dtype=ht.int32, device=cpu:0, split=None)







**NOTE:** Indexing in Heat is undergoing a [major overhaul](https://github.com/helmholtz-analytics/heat/pull/938), to increase interoperability with NumPy/PyTorch indexing, and to provide a fully distributed item setting functionality. Stay tuned for this feature in the next release.

## Documentation

Heat is extensively documented. You may find the online API reference on Read the Docs: [Heat Documentation](https://heat.readthedocs.io/). It is also possible to look up the docs in an interactive session.

In [None]:
%%px --target 0
help(ht.sum)

[stdout:0] Help on function sum in module heat.core.arithmetics:

sum(a: 'DNDarray', axis: 'Union[int, Tuple[int, ...]]' = None, out: 'DNDarray' = None, keepdims: 'bool' = None) -> 'DNDarray'
    Sum of array elements over a given axis. An array with the same shape as ``self.__array`` except
    for the specified axis which becomes one, e.g.
    ``a.shape=(1, 2, 3)`` => ``ht.ones((1, 2, 3)).sum(axis=1).shape=(1, 1, 3)``
    
    Parameters
    ----------
    a : DNDarray
        Input array.
    axis : None or int or Tuple[int,...], optional
        Axis along which a sum is performed. The default, ``axis=None``, will sum all of the
        elements of the input array. If ``axis`` is negative it counts from the last to the first
        axis. If ``axis`` is a tuple of ints, a sum is performed on all of the axes specified in the
        tuple instead of a single axis or all the axes as before.
    out : DNDarray, optional
        Alternative output array in which to place the result. It

## Parallel Processing

Heat's actual power lies in the possibility to exploit the processing performance of modern accelerator hardware (GPUs) as well as distributed (high-performance) cluster systems. All operations executed on CPUs are, to a large extent, vectorized (AVX) and thread-parallelized (OpenMP). Heat builds on PyTorch, so it supports GPU acceleration on Nvidia and AMD GPUs. 

For distributed computations, your system or laptop needs to have Message Passing Interface (MPI) installed. For GPU computations, your system needs to have one or more suitable GPUs and (MPI-aware) CUDA/ROCm ecosystem.

**NOTE:** The GPU examples below will only properly execute on a computer with a GPU. Make sure to either start the notebook on an appropriate machine or copy and paste the examples into a script and execute it on a suitable device.

### GPUs

Heat's array creation functions all support an additional parameter that which places the data on a specific device. By default, the CPU is selected, but it is also possible to directly allocate the data on a GPU.

<div class="alert alert-block alert-info">
<b>The following cells will only work if you have a GPU available.</b>

</div>

In [None]:
%%px --target 0
ht.zeros((3, 4,), device='gpu')

[0:execute]
[31m---------------------------------------------------------------------------[39m
[31mKeyError[39m                                  Traceback (most recent call last)
[36mFile [39m[32m~/code/heat/heat/core/devices.py:190[39m, in [36msanitize_device[39m[34m(device)[39m
[32m    189[39m [38;5;28;01mtry[39;00m:
[32m--> [39m[32m190[39m     [38;5;28;01mreturn[39;00m [43m__device_mapping[49m[43m[[49m[43mdevice[49m[43m.[49m[43mstrip[49m[43m([49m[43m)[49m[43m.[49m[43mlower[49m[43m([49m[43m)[49m[43m][49m
[32m    191[39m [38;5;28;01mexcept[39;00m ([38;5;167;01mAttributeError[39;00m, [38;5;167;01mKeyError[39;00m, [38;5;167;01mTypeError[39;00m):

[31mKeyError[39m: 'gpu'

During handling of the above exception, another exception occurred:

[31mValueError[39m                                Traceback (most recent call last)
[36mCell[39m[36m [39m[32mIn[20][39m[32m, line 1[39m
[32m----> [39m[32m1[39m [43mht[49m[43m.

RemoteError: [0:execute] ValueError: Unknown device, must be one of cpu

Arrays on the same device can be seamlessly used in any Heat operation.

In [None]:
%%px --target 0
a = ht.zeros((3, 4,), device='gpu')
b = ht.ones((3, 4,), device='gpu')
a + b

[0;31mOut[0:21]: [0m<DNDarray(MPI-rank: 0, Shape: (3, 4), Split: None, Local Shape: (3, 4), Device: gpu:0, Dtype: float32)>

However, performing operations on arrays with mismatching devices will purposefully result in an error (due to potentially large copy overhead).

In [None]:
%%px --target 0
a = ht.full((3, 4,), 4, device='cpu')
b = ht.ones((3, 4,), device='gpu')
a + b

[0:execute]
[0;31m---------------------------------------------------------------------------[0m
[0;31mRuntimeError[0m                              Traceback (most recent call last)
Cell [0;32mIn[22], line 3[0m
[1;32m      1[0m a [38;5;241m=[39m ht[38;5;241m.[39mfull(([38;5;241m3[39m, [38;5;241m4[39m,), [38;5;241m4[39m, device[38;5;241m=[39m[38;5;124m'[39m[38;5;124mcpu[39m[38;5;124m'[39m)
[1;32m      2[0m b [38;5;241m=[39m ht[38;5;241m.[39mones(([38;5;241m3[39m, [38;5;241m4[39m,), device[38;5;241m=[39m[38;5;124m'[39m[38;5;124mgpu[39m[38;5;124m'[39m)
[0;32m----> 3[0m [43ma[49m[43m [49m[38;5;241;43m+[39;49m[43m [49m[43mb[49m

File [0;32m~/code/heat/heat/core/arithmetics.py:124[0m, in [0;36m_add[0;34m(self, other)[0m
[1;32m    122[0m [38;5;28;01mdef[39;00m [38;5;21m_add[39m([38;5;28mself[39m, other):
[1;32m    123[0m     [38;5;28;01mtry[39;00m:
[0;32m--> 124[0m         [38;5;28;01mreturn[39;00m [43madd[49m[43m

RemoteError: [0:execute] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

It is possible to explicitly move an array from one device to the other and back to avoid this error.

In [None]:
%%px --target 0
a = ht.full((3, 4,), 4, device='gpu')
a.cpu()

[0;31mOut[0:23]: [0m<DNDarray(MPI-rank: 0, Shape: (3, 4), Split: None, Local Shape: (3, 4), Device: cpu:0, Dtype: float32)>

We'll put our multi-GPU setup to the test in the next section.

### Distributed Computing

Heat is also able to make use of distributed processing capabilities such as those in high-performance cluster systems. For this, Heat exploits the fact that the operations performed on a multi-dimensional array are usually identical for all data items. Hence, a data-parallel processing strategy can be chosen, where the total number of data items is equally divided among all processing nodes. An operation is then performed individually on the local data chunks and, if necessary, communicates partial results behind the scenes. A Heat array assumes the role of a virtual overlay of the local chunks and realizes and coordinates the computations - see the figure below for a visual representation of this concept.

<img src="https://github.com/helmholtz-analytics/heat/blob/main/doc/source/_static/images/split_array.png?raw=true" width="100%"></img>

The chunks are always split along a singular dimension (i.e. 1-D domain decomposition) of the array. You can specify this in Heat by using the `split` paramter. This parameter is present in all relevant functions, such as array creation (`zeros(), ones(), ...`) or I/O (`load()`) functions. 




Examples are provided below. The result of an operation on a Heat tensor will in most cases preserve the split of the respective operands. However, in some cases the split axis might change. For example, a transpose of a Heat array will equally transpose the split axis. Furthermore, a reduction operations, e.g. `sum()` that is performed across the split axis, might remove data partitions entirely. The respective function behaviors can be found in Heat's documentation.

You may also modify the data partitioning of a Heat array by using the `resplit()` function. This allows you to repartition the data as you so choose. Please note, that this should be used sparingly and for small data amounts only, as it entails significant data copying across the network. Finally, a Heat array without any split, i.e. `split=None` (default), will result in redundant copies of data on each computation node.

On a technical level, Heat follows the so-called [Bulk Synchronous Parallel (BSP)](https://en.wikipedia.org/wiki/Bulk_synchronous_parallel) processing model. For the network communication, Heat utilizes the [Message Passing Interface (MPI)](https://computing.llnl.gov/tutorials/mpi/), a *de facto* standard on modern high-performance computing systems. It is also possible to use MPI on your laptop or desktop computer. Respective software packages are available for all major operating systems. In order to run a Heat script, you need to start it slightly differently than you are probably used to. This

```bash
python ./my_script.py
```

becomes this instead:

```bash
mpirun -n <number_of_processors> python ./my_script.py
```
On an HPC cluster you'll of course use SBATCH or similar.


Let's see some examples of working with distributed Heat:

In the following examples, we'll recreate the array shown in the figure, a 3-dimensional DNDarray of integers ranging from 0 to 59 (5 matrices of size (4,3)). 

In [None]:
%%px
import heat as ht
dndarray = ht.arange(60).reshape(5,4,3)
dndarray

[0;31mOut[1:6]: [0m<DNDarray(MPI-rank: 1, Shape: (5, 4, 3), Split: None, Local Shape: (5, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[2:6]: [0m<DNDarray(MPI-rank: 2, Shape: (5, 4, 3), Split: None, Local Shape: (5, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[0:24]: [0m<DNDarray(MPI-rank: 0, Shape: (5, 4, 3), Split: None, Local Shape: (5, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[3:6]: [0m<DNDarray(MPI-rank: 3, Shape: (5, 4, 3), Split: None, Local Shape: (5, 4, 3), Device: cpu:0, Dtype: int32)>

Notice the additional metadata printed with the DNDarray. With respect to a numpy ndarray, the DNDarray has additional information on the device (in this case, the CPU) and the `split` axis. In the example above, the split axis is `None`, meaning that the DNDarray is not distributed and each MPI process has a full copy of the data.

Let's experiment with a distributed DNDarray: we'll split the same DNDarray as above, but distributed along the major axis.

In [None]:
%%px
dndarray = ht.arange(60, split=0).reshape(5,4,3)
dndarray

[0;31mOut[1:7]: [0m<DNDarray(MPI-rank: 1, Shape: (5, 4, 3), Split: 0, Local Shape: (1, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[0:25]: [0m<DNDarray(MPI-rank: 0, Shape: (5, 4, 3), Split: 0, Local Shape: (2, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[3:7]: [0m<DNDarray(MPI-rank: 3, Shape: (5, 4, 3), Split: 0, Local Shape: (1, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[2:7]: [0m<DNDarray(MPI-rank: 2, Shape: (5, 4, 3), Split: 0, Local Shape: (1, 4, 3), Device: cpu:0, Dtype: int32)>

The `split` axis is now 0, meaning that the DNDarray is distributed along the first axis. Each MPI process has a slice of the data along the first axis. In order to see the data on each process, we can print the "local array" via the `larray` attribute.

In [None]:
%%px
dndarray.larray

[0;31mOut[1:8]: [0m
tensor([[[24, 25, 26],
         [27, 28, 29],
         [30, 31, 32],
         [33, 34, 35]]], dtype=torch.int32)

[0;31mOut[3:8]: [0m
tensor([[[48, 49, 50],
         [51, 52, 53],
         [54, 55, 56],
         [57, 58, 59]]], dtype=torch.int32)

[0;31mOut[0:26]: [0m
tensor([[[ 0,  1,  2],
         [ 3,  4,  5],
         [ 6,  7,  8],
         [ 9, 10, 11]],

        [[12, 13, 14],
         [15, 16, 17],
         [18, 19, 20],
         [21, 22, 23]]], dtype=torch.int32)

[0;31mOut[2:8]: [0m
tensor([[[36, 37, 38],
         [39, 40, 41],
         [42, 43, 44],
         [45, 46, 47]]], dtype=torch.int32)

Note that the `larray` is a `torch.Tensor` object. This is the underlying tensor that holds the data. The `dndarray` object is an MPI-aware wrapper around these process-local tensors, providing memory-distributed functionality and information.

The DNDarray can be distributed along any axis. Modify the `split` attribute when creating the DNDarray in the cell above, to distribute it along a different axis, and see how the `larray`s change. You'll notice that the distributed arrays are always load-balanced, meaning that the data are distributed as evenly as possible across the MPI processes.

The `DNDarray` object has a number of methods and attributes that are useful for distributed computing. In particular, it keeps track of its global and local (on a given process) shape through distributed operations and array manipulations. The DNDarray is also associated to a `comm` object, the MPI communicator.

(In MPI, the *communicator* is a group of processes that can communicate with each other. The `comm` object is a `MPI.COMM_WORLD` communicator, which is the default communicator that includes all the processes. The `comm` object is used to perform collective operations, such as reductions, scatter, gather, and broadcast. The `comm` object is also used to perform point-to-point communication between processes.)

In [None]:
%%px
print(f"Global shape of the dndarray: {dndarray.shape}")
print(f"On rank {dndarray.comm.rank}/{dndarray.comm.size}, local shape of the dndarray: {dndarray.lshape}")


[stdout:0] Global shape of the dndarray: (5, 4, 3)
On rank 0/4, local shape of the dndarray: (2, 4, 3)


[stdout:1] Global shape of the dndarray: (5, 4, 3)
On rank 1/4, local shape of the dndarray: (1, 4, 3)


[stdout:2] Global shape of the dndarray: (5, 4, 3)
On rank 2/4, local shape of the dndarray: (1, 4, 3)


[stdout:3] Global shape of the dndarray: (5, 4, 3)
On rank 3/4, local shape of the dndarray: (1, 4, 3)


You can perform a vast number of operations on DNDarrays distributed over multi-node and/or multi-GPU resources. Check out our [Numpy coverage tables](https://github.com/helmholtz-analytics/heat/blob/main/coverage_tables.md) to see what operations are already supported.  

The result of an operation on DNDarays will in most cases preserve the `split` or distribution axis of the respective operands. However, in some cases the split axis might change. For example, a transpose of a Heat array will equally transpose the split axis. Furthermore, a reduction operations, e.g. `sum()` that is performed across the split axis, might remove data partitions entirely. The respective function behaviors can be found in Heat's documentation.

In [None]:
%%px 
# transpose 
dndarray.T


[0;31mOut[0:28]: [0m<DNDarray(MPI-rank: 0, Shape: (3, 4, 5), Split: 2, Local Shape: (3, 4, 2), Device: cpu:0, Dtype: int32)>

[0;31mOut[2:10]: [0m<DNDarray(MPI-rank: 2, Shape: (3, 4, 5), Split: 2, Local Shape: (3, 4, 1), Device: cpu:0, Dtype: int32)>

[0;31mOut[1:10]: [0m<DNDarray(MPI-rank: 1, Shape: (3, 4, 5), Split: 2, Local Shape: (3, 4, 1), Device: cpu:0, Dtype: int32)>

[0;31mOut[3:10]: [0m<DNDarray(MPI-rank: 3, Shape: (3, 4, 5), Split: 2, Local Shape: (3, 4, 1), Device: cpu:0, Dtype: int32)>

In [None]:
%%px
# reduction operation along the distribution axis
%timeit -n 1 dndarray.sum(axis=0)


[stdout:1] The slowest run took 31.60 times longer than the fastest. This could mean that an intermediate result is being cached.
504 µs ± 876 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


[stdout:2] The slowest run took 28.84 times longer than the fastest. This could mean that an intermediate result is being cached.
501 µs ± 864 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


[stdout:0] The slowest run took 29.75 times longer than the fastest. This could mean that an intermediate result is being cached.
503 µs ± 880 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


[stdout:3] The slowest run took 8.36 times longer than the fastest. This could mean that an intermediate result is being cached.
237 µs ± 216 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%px 
# reduction operation along non-distribution axis: no communication required
%timeit -n 1 dndarray.sum(axis=1)

[stdout:0] The slowest run took 13.43 times longer than the fastest. This could mean that an intermediate result is being cached.
114 µs ± 141 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


[stdout:2] 72.7 µs ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


[stdout:1] 71.7 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


[stdout:3] The slowest run took 15.67 times longer than the fastest. This could mean that an intermediate result is being cached.
183 µs ± 291 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


Operations between tensors with equal split or no split are fully parallelizable and therefore very fast.

In [None]:
%%px
other_dndarray = ht.arange(60,120, split=0).reshape(5,4,3) # distributed reshape

# element-wise multiplication
dndarray * other_dndarray


[0;31mOut[1:13]: [0m<DNDarray(MPI-rank: 1, Shape: (5, 4, 3), Split: 0, Local Shape: (1, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[0:31]: [0m<DNDarray(MPI-rank: 0, Shape: (5, 4, 3), Split: 0, Local Shape: (2, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[3:13]: [0m<DNDarray(MPI-rank: 3, Shape: (5, 4, 3), Split: 0, Local Shape: (1, 4, 3), Device: cpu:0, Dtype: int32)>

[0;31mOut[2:13]: [0m<DNDarray(MPI-rank: 2, Shape: (5, 4, 3), Split: 0, Local Shape: (1, 4, 3), Device: cpu:0, Dtype: int32)>

As we saw earlier, because the underlying data objects are PyTorch tensors, we can easily create DNDarrays on GPUs or move DNDarrays to GPUs. This allows us to perform distributed array operations on multi-GPU systems.

So far we have demostrated small, easy-to-parallelize arithmetical operations. Let's move to linear algebra. Heat's `linalg` module supports a wide range of linear algebra operations, including matrix multiplication. Matrix multiplication is a very common operation data analysis, it is computationally intensive, and not trivial to parallelize. 

With Heat, you can perform matrix multiplication on distributed DNDarrays, and the operation will be parallelized across the MPI processes. Here on 4 GPUs:

In [None]:
%%px
# free up memory if necessary
try:
    del x, y, z
except NameError:
    pass

n, m = 4000, 4000
x = ht.random.randn(n, m, split=0, device="gpu") # distributed RNG
y = ht.random.randn(m, n, split=None, device="gpu")
z =  x @ y


`ht.linalg.matmul` or `@` breaks down the matrix multiplication into a series of smaller `torch` matrix multiplications, which are then distributed across the MPI processes. This operation can be very communication-intensive on huge matrices that both require distribution, and users should choose the `split` axis carefully to minimize communication overhead.

You can experiment with sizes and the `split` parameter (distribution axis) for both matrices and time the result. Note that:
- If you set **`split=None` for both matrices**, each process (in this case, each GPU) will attempt to multiply the entire matrices. Depending on the matrix sizes, the GPU memory might be insufficient. (And if you can multiply the matrices on a single GPU, it's much more efficient to stick to PyTorch's `torch.linalg.matmul` function.)
- If **`split` is not None for both matrices**, each process will only hold a slice of the data, and will need to communicate data with other processes in order to perform the multiplication. This **introduces huge communication overhead**, but allows you to perform the multiplication on larger matrices than would fit in the memory of a single GPU.
- If **`split` is None for one matrix and not None for the other**, the multiplication does not require communication, and the result will be distributed. If your data size allows it, you should always favor this option.

Time the multiplication for different split parameters and see how the performance changes.



In [None]:
%%px
z = %timeit -n 1 -r 5 x @ y 

[stdout:1] The slowest run took 15.33 times longer than the fastest. This could mean that an intermediate result is being cached.
2.78 ms ± 2.76 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


[stdout:2] The slowest run took 14.90 times longer than the fastest. This could mean that an intermediate result is being cached.
2.69 ms ± 2.65 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


[stdout:3] The slowest run took 14.88 times longer than the fastest. This could mean that an intermediate result is being cached.
2.22 ms ± 2.24 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


[stdout:0] The slowest run took 14.81 times longer than the fastest. This could mean that an intermediate result is being cached.
2.7 ms ± 2.66 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


Heat supports many linear algebra operations:
```bash
>>> ht.linalg.
ht.linalg.basics        ht.linalg.hsvd_rtol(    ht.linalg.projection(   ht.linalg.triu(
ht.linalg.cg(           ht.linalg.inv(          ht.linalg.qr(           ht.linalg.vdot(
ht.linalg.cross(        ht.linalg.lanczos(      ht.linalg.solver        ht.linalg.vecdot(
ht.linalg.det(          ht.linalg.matmul(       ht.linalg.svdtools      ht.linalg.vector_norm(
ht.linalg.dot(          ht.linalg.matrix_norm(  ht.linalg.trace(        
ht.linalg.hsvd(         ht.linalg.norm(         ht.linalg.transpose(    
ht.linalg.hsvd_rank(    ht.linalg.outer(        ht.linalg.tril(         
```

and a lot more is in the works, including distributed eigendecompositions, SVD, and more. If the operation you need is not yet supported, leave us a note [here](https://github.com/helmholtz-analytics/heat/issues) and we'll get back to you.

You can of course perform all operations on CPUs. You can leave out the `device` attribute entirely.

### Interoperability

We can easily create DNDarrays from PyTorch tensors and numpy ndarrays. We can also convert DNDarrays to PyTorch tensors and numpy ndarrays. This makes it easy to integrate Heat into existing PyTorch and numpy workflows. Here a basic example with xarrays:

In [None]:
%%px
import xarray as xr

local_xr = xr.DataArray(dndarray.larray, dims=("z", "y", "x"))
# proceed with local xarray operations
local_xr



[0:execute]
[0;31m---------------------------------------------------------------------------[0m
[0;31mModuleNotFoundError[0m                       Traceback (most recent call last)
Cell [0;32mIn[34], line 1[0m
[0;32m----> 1[0m [38;5;28;01mimport[39;00m [38;5;21;01mxarray[39;00m [38;5;28;01mas[39;00m [38;5;21;01mxr[39;00m
[1;32m      3[0m local_xr [38;5;241m=[39m xr[38;5;241m.[39mDataArray(dndarray[38;5;241m.[39mlarray, dims[38;5;241m=[39m([38;5;124m"[39m[38;5;124mz[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124my[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mx[39m[38;5;124m"[39m))
[1;32m      4[0m [38;5;66;03m# proceed with local xarray operations[39;00m

[0;31mModuleNotFoundError[0m: No module named 'xarray'
[2:execute]
[0;31m---------------------------------------------------------------------------[0m
[0;31mModuleNotFoundError[0m                       Traceback (most recent call last)
Cell [0;32mIn[16], line 1[0m
[0;32m----> 1[

AlreadyDisplayedError: 4 errors

**NOTE:** this is not a distributed `xarray`, but local xarray objects on each rank.
Work on [expanding xarray support](https://github.com/helmholtz-analytics/heat/pull/1183) is ongoing.


Heat will try to reuse the memory of the original array as much as possible. If you would prefer a copy with different memory, the ```copy``` keyword argument can be used when creating a DNDArray from other libraries.

In [None]:
%%px
import torch
torch_array = torch.arange(5)
heat_array = ht.array(torch_array, copy=False)
heat_array[0] = -1
print(torch_array)

torch_array = torch.arange(5)
heat_array = ht.array(torch_array, copy=True)
heat_array[0] = -1
print(torch_array)

[stdout:0] tensor([-1,  1,  2,  3,  4])
tensor([0, 1, 2, 3, 4])


[stdout:1] tensor([-1,  1,  2,  3,  4])
tensor([0, 1, 2, 3, 4])


[stdout:2] tensor([-1,  1,  2,  3,  4])
tensor([0, 1, 2, 3, 4])


[stdout:3] tensor([-1,  1,  2,  3,  4])
tensor([0, 1, 2, 3, 4])


Interoperability is a key feature of Heat, and we are constantly working to increase Heat's compliance to the [Python array API standard](https://data-apis.org/array-api/latest/). As usual, please [let us know](https://github.com/helmholtz-analytics/heat/issues) if you encounter any issues or have any feature requests.

In the [next notebook](2_internals.ipynb), let's have a look at Heat's most important internal functions.