# Heat Tutorial
---

The original version of this tutorial was inspired by the [CS228 tutorial](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) by Volodomyr Kuleshov and Isaac Caswell.

For this interactive HPC adaptation, we have heavily referenced the [HPC Python](https://gitlab.jsc.fz-juelich.de/sdlbio-courses/hpc-python) course and the [jupyter-jsc](https://github.com/FZJ-JSC/jupyter-jsc-notebooks) repository. Many thanks Jan Meinke, Jens Henrik Goebbert, Tim Kreuzer, Alice Gorsch @ JSC for help setting this up.

## Introduction
---

**Table of Contents**
(copilot generated, needs to be updated)
1. [Introduction](#Introduction)
2. [Getting Started](#Getting-Started)
3. [Heat Basics](#Heat-Basics)
4. [Heat Arrays](#Heat-Arrays)
5. [Heat Operations](#Heat-Operations)


<div style="float: right; padding-right: 2em; padding-top: 2em;">
    <img src="https://raw.githubusercontent.com/helmholtz-analytics/heat/master/doc/images/logo.png"></img>
</div>


FILL OUT LATER


This tutorial is designed to run on [Jupyter Notebook servers at the  Jülich Supercomputing Centre](https://jupyter-jsc.fz-juelich.de/). 

deRSE24 participants will have [received instructions](https://pad.gwdg.de/s2GbnPwcTWeSK-OFKs4nAw#) on how to request compute resources associated with the training. 

HERE INSERT SCREENSHOT OF THE JUPYTER-JSC HUB

Log in to the [jupyter-jsc](https://jupyter-jsc.fz-juelich.de/) hub and start a new terminal. In the terminal, copy the tutorial repository:


In general, you can install Heat easily via `pip install heat` or `conda install - conda-forge heat`, with the main dependencies being:
 - some MPI distribution and `mpi4py`;
 - a `torch` installation suited to your hardware accelerators (if any). 
 
 Installation on an HPC system is also straightforward, but heavily tuned to the available hardware. In this short tutorial, we will skip the installation part, and use a dedicated kernel that we have created in advance. 
 
 You do need to load the cluster modules needed for the kernel to work. In the terminal, type:

```bash
source /p/project/training2404/training2404.sh
```
THIS NEEDS FIXING

Now click on Select Kernel and choose the `heat1.3.1` kernel. 

Before we can test if Heat can be imported, we need to initialize the `ipcluster`. In the terminal, type:

```bash
ipcontroller  &
srun -n 4 -c 12 --ntasks-per-node 4 --time 00:30:00   -A training2404 ipengine start
```
On your terminal, you should see something like this:

```bash
FILL OUT
```

Reload the kernel. You now have access to 4 MPI processes that can be used by Heat either on CPU, or on the 4 GPUs available on each node.




## Basics: what is Heat for?
---

Straight from our [GitHub repository](https://github.com/helmholtz-analytics/heat):

Heat builds on [PyTorch](https://pytorch.org/) and [mpi4py](https://mpi4py.readthedocs.io) to provide high-performance computing infrastructure for memory-intensive applications within the NumPy/SciPy ecosystem.


With Heat you can:
- port existing NumPy/SciPy code from single-CPU to multi-node clusters with minimal coding effort;
- exploit the entire, cumulative RAM of your many nodes for memory-intensive operations and algorithms;
- run your NumPy/SciPy code on GPUs (CUDA, ROCm, coming up: Apple MPS).


### In practice

FILL IN

## Heat Arrays
---

To be able to start working with Heat on an HPC cluster, we first need to check the health of the available processes. We will use the `ipyparallel` client for this.

In [None]:
from ipyparallel import Client
rc = Client(profile="default")

We have started the `ipcontroller` and `ipengine` processes with 4 processes. We can now check if the processes are available.

In [3]:
rc.ids

[0, 1, 2, 3]

TODO: Here explain %%px magic

In [5]:
%px import heat as ht

  from .autonotebook import tqdm as notebook_tqdm


  from .autonotebook import tqdm as notebook_tqdm


  from .autonotebook import tqdm as notebook_tqdm


  from .autonotebook import tqdm as notebook_tqdm


%px: 100%|██████████| 4/4 [00:07<00:00,  1.96s/tasks]


Similar to a NumPy array, a Heat array is a grid of values of a single (one particular) type. The number of dimensions is the number of axes of the array, while the shape of an array is a tuple of integers giving the number of elements of the array along each dimension. 

Heat emulates NumPy's API as closely as possible, allowing for the use of well-known array creation functions.

Note that, because we are running these cells of a 4-process "cluster", each print statement will be printed 4 times.

In [7]:
%%px
import heat as ht
a = ht.array([1, 2, 3])
a


[stdout:0] tensor([], device='cuda:3', dtype=torch.int64)


[stdout:2] tensor([2], device='cuda:1')


[stdout:3] tensor([1], device='cuda:0')


[stdout:1] tensor([3], device='cuda:2')


In [3]:
%%px
ht.ones((4, 5,))

DNDarray([[1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.],
          [1., 1., 1., 1., 1.]], dtype=ht.float32, device=cpu:0, split=None)

In [4]:
%%px
ht.arange(10)

DNDarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=ht.int32, device=cpu:0, split=None)

In [5]:
%%px
ht.full((3, 2,), fill_value=9)

DNDarray([[9., 9.],
          [9., 9.],
          [9., 9.]], dtype=ht.float32, device=cpu:0, split=None)

### Data Types

Heat supports various data types and operations to retrieve and manipulate the type of a Heat array. However, in contrast to NumPy, Heat is limited to logical (bool) and numerical types (uint8, int16/32/64, float32/64, and complex64/128). 

**NOTE:** by default, Heat will allocate floating-point values in single precision, due to a much higher processing performance on GPUs. This is one of the main differences between Heat and NumPy.

In [6]:
%%px
a = ht.zeros((3, 4,))
a, a.dtype

(DNDarray([[0., 0., 0., 0.],
           [0., 0., 0., 0.],
           [0., 0., 0., 0.]], dtype=ht.float32, device=cpu:0, split=None),
 heat.core.types.float32)

In [7]:
%%px
b = a.astype(ht.int64)
b, b.dtype

(DNDarray([[0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0]], dtype=ht.int64, device=cpu:0, split=None),
 heat.core.types.int64)

In [8]:
%%px
ht.zeros((3, 4,), dtype=ht.int8)

DNDarray([[0, 0, 0, 0],
          [0, 0, 0, 0],
          [0, 0, 0, 0]], dtype=ht.int8, device=cpu:0, split=None)

### Operations

Heat supports several mathematical operations, ranging from simple element-wise functions, binary arithmetic operations, and linear algebra, to more powerful reductions. Operations are by default performed on the entire array or they can be performed along one or more of its dimensions when available.

In [9]:
%%px
a = ht.full((3, 4,), 8)
b = ht.ones((3, 4,))

In [10]:
%%px
a + b

DNDarray([[9., 9., 9., 9.],
          [9., 9., 9., 9.],
          [9., 9., 9., 9.]], dtype=ht.float32, device=cpu:0, split=None)

In [11]:
%%px
ht.sub(a, b)

DNDarray([[7., 7., 7., 7.],
          [7., 7., 7., 7.],
          [7., 7., 7., 7.]], dtype=ht.float32, device=cpu:0, split=None)

In [12]:
%%px
ht.arange(5).sin()

DNDarray([ 0.0000,  0.8415,  0.9093,  0.1411, -0.7568], dtype=ht.float32, device=cpu:0, split=None)

In [13]:
%%px
a.T

DNDarray([[8., 8., 8.],
          [8., 8., 8.],
          [8., 8., 8.],
          [8., 8., 8.]], dtype=ht.float32, device=cpu:0, split=None)

In [14]:
%%px
b.sum(axis=1)

DNDarray([4., 4., 4.], dtype=ht.float32, device=cpu:0, split=None)

---
Heat implements the same broadcasting rules (implicit repetion of an operation when the rank/shape of the operands do not match) as NumPy does, e.g.:

In [15]:
%%px
ht.arange(10) + 3

DNDarray([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=ht.int64, device=cpu:0, split=None)

In [16]:
%%px
a = ht.ones((3, 4,))
b = ht.arange(4)
c = a + b

a, b, c

(DNDarray([[1., 1., 1., 1.],
           [1., 1., 1., 1.],
           [1., 1., 1., 1.]], dtype=ht.float32, device=cpu:0, split=None),
 DNDarray([0, 1, 2, 3], dtype=ht.int32, device=cpu:0, split=None),
 DNDarray([[1., 2., 3., 4.],
           [1., 2., 3., 4.],
           [1., 2., 3., 4.]], dtype=ht.float32, device=cpu:0, split=None))

### Indexing

Heat allows the indexing of arrays, and thereby, the extraction of a partial view of the elements in an array. It is possible to obtain single values as well as entire chunks, i.e. slices.

In [17]:
%%px
a = ht.arange(10)
a

DNDarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=ht.int32, device=cpu:0, split=None)

In [18]:
%%px
a[3]

DNDarray(3, dtype=ht.int32, device=cpu:0, split=None)

In [19]:
%%px
a[1:7]

DNDarray([1, 2, 3, 4, 5, 6], dtype=ht.int32, device=cpu:0, split=None)

In [20]:
%%px
a[::2]

DNDarray([0, 2, 4, 6, 8], dtype=ht.int32, device=cpu:0, split=None)

**NOTE:** Indexing in Heat is undergoing a major overhaul, to increase interoperability with NumPy/PyTorch indexing, and to provide a fully distributed item setting functionality. If you're interested, you can preview the new indexing functionality by loading the `heat-dev-indexing` kernel.

### Documentation

Heat is extensively documented. You may find the online API reference on Read the Docs: [Heat Documentation](https://heat.readthedocs.io/). It is also possible to look up the docs in an interactive session.

In [21]:
%%px
help(ht.sum)

Help on function sum in module heat.core.arithmetics:

sum(a: 'DNDarray', axis: 'Union[int, Tuple[int, ...]]' = None, out: 'DNDarray' = None, keepdim: 'bool' = None) -> 'DNDarray'
    Sum of array elements over a given axis. An array with the same shape as ``self.__array`` except for the specified
    axis which becomes one, e.g. ``a.shape=(1, 2, 3)`` => ``ht.ones((1, 2, 3)).sum(axis=1).shape=(1, 1, 3)``
    
    Parameters
    ----------
    a : DNDarray
        Input array.
    axis : None or int or Tuple[int,...], optional
        Axis along which a sum is performed. The default, ``axis=None``, will sum all of the elements of the input array.
        If ``axis`` is negative it counts from the last to the first axis. If ``axis`` is a tuple of ints, a sum is performed
        on all of the axes specified in the tuple instead of a single axis or all the axes as before.
    out : DNDarray, optional
        Alternative output array in which to place the result. It must have the same shap

## Parallel Processing
---

Heat's actual power lies in the possibility to exploit the processing performance of modern accelerator hardware (GPUs) as well as distributed (high-performance) cluster systems. All operations executed on CPUs are, to a large extent, vectorized (AVX) and thread-parallelized (OpenMP). Heat builds on PyTorch, so it supports GPU acceleration on Nvidia and AMD GPUs. 

For distributed computations, your system or laptop needs to have Message Passing Interface (MPI) installed. For GPU computations, your system needs to have one or more suitable GPUs and (MPI-aware) CUDA/ROCm ecosystem.

**NOTE:** The GPU examples below will only properly execute on a computer with a CUDA GPU. Make sure to either start the notebook on an appropriate machine or copy and paste the examples into a script and execute it on a suitable device.

**NOTE: ** All examples below explaining the distributed processing capabilities need to be executed outside this notebook in a separate MPI-capable environment. We suggest to copy and paste the code snippets into a script and execute it.

### GPUs

Heat's array creation functions all support an additional parameter that which places the data on a specific device. By default, the CPU is selected, but it is also possible to directly allocate the data on a GPU.

In [22]:
ht.zeros((3, 4,), device='gpu')

DNDarray([[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]], dtype=ht.float32, device=gpu:0, split=None)

Arrays on the same device can be seamlessly used in any Heat operation.

In [23]:
a = ht.zeros((3, 4,), device='gpu')
b = ht.ones((3, 4,), device='gpu')
a + b

DNDarray([[1., 1., 1., 1.],
          [1., 1., 1., 1.],
          [1., 1., 1., 1.]], dtype=ht.float32, device=gpu:0, split=None)

However, performing operations on arrays with mismatching devices will purposefully result in an error (due to potentially large copy overhead).

In [24]:
a = ht.full((3, 4,), 4, device='cpu')
b = ht.ones((3, 4,), device='gpu')
a + b

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

It is possible to explicitly move an array from one device to the other and back to avoid this error.

In [25]:
a = ht.full((3, 4,), 4, device='gpu')
a.cpu()

DNDarray([[4., 4., 4., 4.],
          [4., 4., 4., 4.],
          [4., 4., 4., 4.]], dtype=ht.float32, device=cpu:0, split=None)

When writing code for GPUs only, you might quickly find it tedious to explicitly place everything on the GPU by specifying the `device=` parameter. Hence, it is possible to set a default backend on which Heat will work on.

In [26]:
ht.use_device('gpu')

### Distributed Computing

Heat is also able to make use of distributed processing capabilities such as those in high-performance cluster systems. For this, Heat exploits the fact that the operations performed on a multi-dimensional array are usually identical for all data items. Hence, a data-parallel processing strategy can be chosen, where the total number of data items is equally divided among all processing nodes. An operation is then performed individually on the local data chunks and, if necessary, communicates partial results behind the scenes. A Heat array assumes the role of a virtual overlay of the local chunks and realizes and coordinates the computations. Please see the figure below for a visual representation of this concept.

<img src="https://raw.githubusercontent.com/helmholtz-analytics/heat/master/doc/images/heat_split_array.png" width="40%"></img>

The chunks are always split along a singular dimension (i.e. 1D domain decomposition) of the array. You can specify this in Heat by using the `split` paramter. This parameter is present in all relevant functions, such as array creation (`zeros(), ones(), ...`) or I/O (`load()`) functions. Examples are provided below. The result of an operation on a Heat tensor will in most cases preserve the split of the respective operands. However, in some cases the split axis might change. For example, a transpose of a Heat array will equally transpose the split axis. Furthermore, a reduction operations, e.g. `sum()` that is performed across the split axis, might remove data partitions entirely. The respective function behaviors can be found in Heat's documentation.

You may also modify the data partitioning of a Heat array by using the `resplit()` function. This allows you to repartition the data as you so choose. Please note, that this should be used sparingly and for small data amounts only, as it entails significant data copying across the network. Finally, a Heat array without any split, i.e. `split=None` (default), will result in redundant copies of data on each computation node.

On a technical level, Heat follows the so-called [Bulk Synchronous Parallel (BSP)](https://en.wikipedia.org/wiki/Bulk_synchronous_parallel) processing model. For the network communication, Heat utilizes the [Message Passing Interface (MPI)](https://computing.llnl.gov/tutorials/mpi/), a defacto standard on modern high-performance computing systems. It is also possible to use MPI on your laptop or desktop computer. Respective software packages are available for all major operating systems. In order to run a Heat script, you need to start it slightly differently than you are probably used to. This

```bash
python ./my_script.py
```

becomes this instead:

```bash
mpirun -p <number_of_processors> python ./my_script.py
```

Let's see some examples of working with distributed Heat

**NOTE: ** In the following we will use a `(<processor_id>/<processor_count>)` prefix on each output to clearly show, what each individual process is printing. In actual application you would not observe this behavior.

"Unsplit" data, i.e. local copies:

In [27]:
ht.arange(10, split=None)  # equivalent to just saying ht.arange(10)

(0/2) tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=torch.int32)
(1/2) tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=torch.int32)

Data division along the major axis

In [28]:
ht.arange(10, split=0)

(0/2) tensor([0, 1, 2, 3, 4], dtype=torch.int32)
(1/2) tensor([5, 6, 7, 8, 9], dtype=torch.int32)

Other split axes are also possible

In [29]:
ht.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8]
], split=1)

(0/2) tensor([[1, 2],
(0/2)         [5, 6]]),
(1/2) tensor([[3, 4],
(1/2)         [7, 8]])

Repartitioning of the data

In [30]:
a = ht.zeros((4, 6,), split=1)
a

(0/2) tensor([[0., 0., 0.],
(0/2)         [0., 0., 0.],
(0/2)         [0., 0., 0.],
(0/2)         [0., 0., 0.]])
(1/2) tensor([[0., 0., 0.],
(1/2)         [0., 0., 0.],
(1/2)         [0., 0., 0.],
(1/2)         [0., 0., 0.]])

In [31]:
a.resplit(0)

(0/2) tensor([[0., 0., 0., 0., 0., 0.],
(0/2)         [0., 0., 0., 0., 0., 0.]])
(1/2) tensor([[0., 0., 0., 0., 0., 0.],
(1/2)         [0., 0., 0., 0., 0., 0.]])

Distributed operations

In [32]:
ht.arange(10, split=0) + 3

(0/2) tensor([3, 4, 5, 6, 7], dtype=torch.int32)
(1/2) tensor([8, 9, 10, 11, 12], dtype=torch.int32)

Operations between tensors with equal split or no split are fully parallelizable and therefore very fast.

In [33]:
a = ht.arange(10, split=0)
b = ht.ones((10,), split=0)
a + b

(0/2) tensor([1, 2, 3, 4, 5], dtype=torch.int32)
(1/2) tensor([6, 7, 8, 9, 10], dtype=torch.int32)

### Parallel Interactive Interpreter

Heat allows you to interactively program and debug distributed code. The root process will spawn an interactive shell, that forwards the inputs to all other ranks and equally collects the output of all nodes. The interactive interpreter can be found in the Heat sources in the path `scripts/interactive.py` or can be download like this `wget https://raw.githubusercontent.com/helmholtz-analytics/heat/master/scripts/interactive.py`.

You can start the interactive interpreter by invoking the following command. The `-s all` flag must be passed to the interpeter for it to work.

```bash
mpirun -s all -np <procs> python interactive.py
```

**NOTE: ** the interactive interpreter unfortunately does not support the full set of control commands, disallowing 'arrow-up' command repetition for example.

### Dos and Don'ts

In this section we would like to address a few best practices for programming with Heat. While we can obviously not cover all issues, these are major pointers as how to get reasonable performance.

**Dos**

* Split up large data amounts
    * often you input data set along the 'observations/samples' dimension
    * large intermediate matrices
* Use the Heat API
    * computational kernels are optimized
    * Python constructs (e.g. loops) tend to be slow
* Potentially have a copy of certain data with different splits

**Dont's**

* Avoid extensive data copying, e.g.
    * operations with operands of different splits (except None)
    * reshape() that actually change the array dimensions (adding extra dimensions with size 1 is fine)
* Execute everything on GPU
    * computation-intensive operations are usually a good fit
    * operations extensively accessing memory only (e.g. sorting) are not