# HeAT Tutorial
---

Inspired by the [CS228 tutorial](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) from Volodomyr Kuleshov and Isaac Caswell.

## Introduction
---

**Table of Contents**

<div style="float: right; padding-right: 2em; padding-top: 2em;">
    <img src="https://raw.githubusercontent.com/helmholtz-analytics/heat/master/doc/images/logo_HeAT.png"></img>
</div>

* [Installation](#Installation)
    * [Dependencies](#Dependencies)
    * [Dependencies](#Dependencies)
* [HeAT Arrays](#HeAT-Arrays)
    * [Data Types](#Data-Types)
    * [Operations](#Operations)
    * [Indexing](#Indexing)
* [Parallel Processing](#Parallel-Processing)
    * [GPUs](#Dependencies)
    * [Distributed Computing](#Distributed-Computing)
    * [Parallel Interactive Interpreter](#Parallel-Interactive-Interpreter)
    * [Dos and Don'ts](#Dos-and-Don'ts)

HeAT is a flexible and seamless open-source software for high performance data analytics and machine learnings. It provides highly optimized algorithms and data structures for multi-dimensional arrays computations using CPUs, GPUs and distributed cluster systems. The goal of HeAT is to fill the gap between data analytics and machine learning libraries with a strong focus on on single-node performance, and traditional high-performance computing (HPC). HeAT's generic Python-first programming interface integrates seamlessly with the existing data science ecosystem and makes it as effortless as using numpy to write scalable scientific and data science applications that go beyond the computational and memory needs of your laptop and desktop.

For this tutorial, we assume that you are somewhat proficient in the Python programming language. Equally, it is beneficial that you have worked with vectorized multi-dimensional array data structures before, as offered by NumPy, Matlab or R for example. If not or you feel like refreshing your knowledge, you might find the following ressources useful: [CS228 Python and NumPy Tutorial](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb), [NumPy for MATLAB users](https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html) and [NumPy for R users](http://mathesaurus.sourceforge.net/r-numpy.html)

In line with this tutorial, we will cover the following topics

* Installation and setup of HeAT
* Working with HeAT arrays, operations, indexing etc.
* Utilizing HeAT's scalable parallel processing capabilities

## Installation
---

In most use cases the best way to install HeAT on your system is to use the official pre-built package from the Python Package index (PyPi) as follow.

```bash
python -m pip install heat
```

You might need to use the `--user` flag or a [virtual environment](https://docs.python.org/3/library/venv.html) on systems where you do not have sufficient priviliges.

You can also install the latest greatest HeAT version by cloning the HeAT source code repository and a manual installation.

```bash
git clone https://github.com/helmholtz-analytics/heat && cd heat && pip install .
```

### Dependencies

HeAT requires you to have an [MPI](https://computing.llnl.gov/tutorials/mpi/) installation on your system in order to enable parallel processing capabilities. If not already present on your system (also applies to laptops, desktops etc.) you can obtain it through your systems package manager (here: OpenMPI), e.g.:

```bash
apt-get install libopenmpi-dev (Ubuntu, Debian)
dnf install openmpi-devel (Fedora)
yum install openmpi-devel (CentOS)
```

Installing these dependencies usually requires administrator priviliges.

### Optional Features

HeAT may be installed with several optional features, i.e. GPU support on top of CUDA, HDF5 and NetCDF4 (parallel) I/O. If you feel like using these features, this how you can enable them

* GPU support—ensure that CUDA is installed on your system. You may find an installation guide [here](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html).
* HDF5 support—install HDF5 via your system's package manager, preferably with parallel I/O capabilities

```bash
apt-get install libhdf5-openmpi-dev (Ubuntu, Debian)
dnf install hdf5-openmpi-devel (Fedora)
yum install hdf5-openmpi-devel (CentOS)
```

* NetCDF4 support—install NetCDF4 via your system's package manager, preferably with parallel I/O capabilities

```bash
apt-get install libnetcdf-dev (Ubuntu, Debian)
dnf install netcdf-openmpi-devel (Fedora)
yum install netcdf-openmpi-devel (CentOS)
```

When you install HeAT you need to explicitly state that you also want to install all modules for HDF5 and NetCDF4 support by specifying an extras flag, i.e.:

```bash
pip install -e .[hdf5,netcdf] heat
```

respectively

```bash
git clone https://github.com/helmholtz-analytics/heat && cd heat && pip install -e [hdf5,netcdf] .
```

It is possible to only install either HDF5 or NetCDF4 support by leaving out the respective extra dependency in the above command.

## HeAT Arrays
---

To be able to start working with HeAT, we first have to import it.

In [4]:
import heat as ht

Similarly to a NumPy array, a HeAT array is a grid of values, all of identical type. The number of dimensions is the rank of the array, while the shape of an array is a tuple of integers giving the number of elements of the array along each dimension. 

HeAT tries to mimic NumPy's API as closely as possible, allowing to use well-known array creation functions.

In [5]:
ht.array([1, 2, 3])

tensor([1, 2, 3])

In [6]:
ht.ones((4, 5,))

tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

In [13]:
ht.arange(10)

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=torch.int32)

In [12]:
ht.full((3, 2,), fill_value=9)

tensor([[9., 9.],
        [9., 9.],
        [9., 9.]])

### Data Types

HeAT supports several different data types and operation to retrieve and manipulate the type of a HeAT array. However, in contrast to NumPy, HeAT limits itself to logical (bool) and numerical types only (uint8, int16/32/64 and float32/64). 

**NOTE:** by default HeAT will allocate floating-point values in single-precision only, due to a much higher processing performance on GPUs.

In [15]:
a = ht.zeros((3, 4,))
a, a.dtype

(tensor([[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]), heat.core.types.float32)

In [17]:
b = a.astype(ht.int64)
b, b.dtype

(tensor([[0, 0, 0, 0],
         [0, 0, 0, 0],
         [0, 0, 0, 0]]), heat.core.types.int64)

In [18]:
ht.zeros((3, 4,), dtype=ht.int8)

tensor([[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]], dtype=torch.int8)

### Operations

HeAT supports several mathematical operations, ranging from simple element-wise functions, binary arithmetic operations, linear algebra to powerful reductions. Operations are by default performed on the entire values or along one or more dimensions of the array.

In [26]:
a = ht.full((3, 4,), 8)
b = ht.ones((3, 4,))

In [27]:
a + b

tensor([[9., 9., 9., 9.],
        [9., 9., 9., 9.],
        [9., 9., 9., 9.]])

In [28]:
ht.sub(a, b)

tensor([[7., 7., 7., 7.],
        [7., 7., 7., 7.],
        [7., 7., 7., 7.]])

In [30]:
ht.arange(5).sin()

tensor([ 0.0000,  0.8415,  0.9093,  0.1411, -0.7568], dtype=torch.float64)

In [31]:
a.T

tensor([[8., 8., 8.],
        [8., 8., 8.],
        [8., 8., 8.],
        [8., 8., 8.]])

In [32]:
b.sum(axis=1)

tensor([[4.],
        [4.],
        [4.]])

---
HeAT implements the same broadcasting rules (implicit repition of an operation when the rank/shape of the operands do not match) as NumPy does, e.g.:

In [21]:
ht.arange(10) + 3

tensor([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=torch.int32)

In [24]:
a = ht.ones((3, 4,))
b = ht.arange(4)
c = a + b

a, b, c

(tensor([[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]),
 tensor([0, 1, 2, 3], dtype=torch.int32),
 tensor([[1., 2., 3., 4.],
         [1., 2., 3., 4.],
         [1., 2., 3., 4.]]))

### Indexing

HeAT allows to index arrays and thereby extracting a partial view of the elements in an array. It is possible to get obtain single values as well as entire chunks, called slices.

In [34]:
a = ht.arange(10)
a

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=torch.int32)

In [35]:
a[3]

tensor(3, dtype=torch.int32)

In [36]:
a[1:7]

tensor([1, 2, 3, 4, 5, 6], dtype=torch.int32)

In [37]:
a[::2]

tensor([0, 2, 4, 6, 8], dtype=torch.int32)

### Documentation

HeAT is extensively documented. You may find the online API reference on Read the Docs: [HeAT Documentation](https://heat.readthedocs.io/). It is also possible to look up the docs in an interactive session.

In [20]:
help(ht.sum)

Help on function sum in module heat.core.arithmetics:

sum(x, axis=None, out=None, keepdim=None)
    Sum of array elements over a given axis.
    
    Parameters
    ----------
    x : ht.DNDarray
        Input data.
    axis : None or int or tuple of ints, optional
        Axis along which a sum is performed. The default, axis=None, will sum
        all of the elements of the input array. If axis is negative it counts
        from the last to the first axis.
    
        If axis is a tuple of ints, a sum is performed on all of the axes specified
        in the tuple instead of a single axis or all the axes as before.
    
    Returns
    -------
    sum_along_axis : ht.DNDarray
        An array with the same shape as self.__array except for the specified axis which
        becomes one, e.g. a.shape = (1, 2, 3) => ht.ones((1, 2, 3)).sum(axis=1).shape = (1, 1, 3)
    
    Examples
    --------
    >>> ht.sum(ht.ones(2))
    tensor([2.])
    
    >>> ht.sum(ht.ones((3,3)))
    tensor([9.

## Parallel Processing
---

HeAT actual power lies in the possibility to exploit the processing performance of modern accelerator hardware (GPUs) as well as distributed (high-performance) cluster systems. By itself all operations executed on CPUs are to large extent vectorized (AVX) and thread-parallelized (OpenMP). We utilize CUDA to process data on GPUs, requiring you to have a suitable nVidia device and the Message Passing Interface (MPI) for distributed computations.

**NOTE:** The GPU examples below will only properly execute on a computer with a CUDA GPU. Make sure to either start the notebook on an appropriate machine or copy and paste the examples into a script and execute it on a suitable device.

**NOTE: ** All examples below explaining the distributed processing capabilities need to be executed outside this notebook in a separate MPI-capable environment. We suggest to copy and paste the code snippets into a script and execute it.

### GPUs

HeAT's array creation functions all support an additional parameter that allow to place the data on a specific device. By default, the CPU is selected, but it is also possible to directly allocate the data on a GPU.

In [38]:
ht.zeros((3,4,), device='gpu')

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], device='cuda:0')

Arrays on the same device can be seemlessly used in any heat operation.

In [39]:
a = ht.zeros((3,4,), device='gpu')
b = ht.ones((3,4,), device='gpu')
a + b

tensor([[5., 5., 5., 5.],
        [5., 5., 5., 5.],
        [5., 5., 5., 5.]], device='cuda:0')

However, performing operations on arrays with mismatching device will purposefully result result in an error (due to potentially large copy overhead).

In [40]:
a = ht.full((3,4,), 4, device='cpu')
b = ht.ones((3,4,), device='gpu')
a + b

RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor

It is possible, though, to explicitly move an array from one device to the other and back.

In [41]:
a = ht.full((3,4,), 4, device='gpu')
a.cpu()

tensor([[4., 4., 4., 4.],
        [4., 4., 4., 4.],
        [4., 4., 4., 4.]])

When writing code for GPUs only, you might quickly find it tedious to explicitly place every on the GPU by specifying the `device=` parameter. Hence, it is possible to set a default backend on which HeAT will work on.

In [None]:
ht.use_backend('gpu')

### Distributed Computing



<img src="https://raw.githubusercontent.com/helmholtz-analytics/heat/master/doc/images/heat_split_array.png" width="40%"></img>

**NOTE: ** In the following we will use a `(<mpi_rank>/<mpi_size>)` prefix on each output to clearly show, what each individual process is printing. In actual application you would not observe this behavior.

### Parallel Interactive Interpreter

HeAT allows you to interactively program and debug distributed code. The root process will spawn an interactive shell, that forwards the inputs to all other ranks and equally collects the output of all nodes. The interactive interpreter can be found in the HeAT sources in the path `scripts/interactive.py` or can be download like this `wget https://raw.githubusercontent.com/helmholtz-analytics/heat/master/scripts/interactive.py`.

You can start the interactive interpreter by invoking the following command. The `-s all` flag must be passed to the interpeter for it to work.

```bash
mpirun -s all -np <procs> python interactive.py
```

**NOTE: ** the interactive interpreter unfortunately does not support the full set of control commands, disallowing 'arrow-up' command repetition for example.

### Dos and Don'ts

In this section we would like to address a few best practices for programming with HeAT. While we can obviously not cover all issues, these are major pointers as how to get reasonable performance.

**Dos**

* Split up large data amounts
    * often you input data set along the 'observations/samples' dimension
    * large intermediate matrices
* Use the HeAT API
    * computational kernels are optimized
    * Python constructs (e.g. loops) tend to be slow

**Dont's**

* Avoid extensive data copying, e.g.
    * operations with operands of different splits (except None)
    * reshape() that actually change the array dimensions (adding extra dimensions with size 1 is fine)
* Execute everything on GPU
    * computation-intensive operations are usually a good fit
    * operations extensively accessing memory only (e.g. sorting) are not