Using PyTorch as compute engine and mpi4py for communication, Heat implements a number of machine learning algorithms that are optimized for memory-distributed data volumes. This allows you to tackle datasets that are too large for single-node (or worse, single-GPU) processing. 

As opposed to task-parallel frameworks, Heat takes a data-parallel approach, meaning that each "worker" or MPI process performs the same tasks on different slices of the data, and communication between processes is performed transparently when the task requires it. 

In other words: 
- you don't have to worry about optimizing data chunk sizes; 
- you don't have to make sure your research problem is embarassingly parallel, or artificially make your dataset smaller so your RAM is sufficient; 
- you do have to make sure that you have sufficient **overall** RAM to run your global task (e.g. number of nodes / GPUs).

The following shows a few examples. We'll use "small" datasets here as each of us only has access to one node only.

# Loading and Preprocessing

Let's start with loading a data set. This particular example data set (generated from all Asteroids from the JPL Small Body Database) is really small, but it allows to demonstrate the basic functionality of Heat. Note that the `.h5`-format is a typical file format used in HPC since it allows for efficient parallel I/O. 

In [None]:
X = ht.load_hdf5("data/sbdb_asteroids.h5","data",split=0, device="gpu")

To get a first overview, we can print the data and determine its feature-wise mean, variance, min, max etc. 

In [None]:
print(X)

print(ht.mean(X,axis=0))
print(ht.var(X,axis=0))
print(ht.max(X,axis=0))
print(ht.min(axis=0))
print(ht.percentile(X,50.,axis=0))

Next, we can preprocess the data, e.g., by standardizing and/or normalizing. Heat offers several preprocessing routines for doing so, similar to the ones known from scikit-learn. 

In [None]:
scaler = ht.preprocessing.StandardScaler()
X_standardized = scaler.fit_transform(X)
print(ht.mean(X_standardized,axis=0))
print(ht.var(X_standardized,axis=0)

scaler = ht.RobustScaler()
X_robust = scaler.fit_transform(X)
print(ht.median(X_robust,axis=0))


# Clustering


# Truncated SVD

### SVD and its truncated counterparts in a nutshell 
Let $X \in \mathbb{R}^{m \times n}$ be a matrix, e.g., given by a data set consisting of $m$ data points $\in \mathbb{R}^n$ stacked together. The so-called **singular value decomposition (SVD)** of $X$ is given by 

$$
	X = U \Sigma V^T
$$

where $U \in \mathbb{R}^{m \times r_X}$ and $V \in \mathbb{R}^{n \times r_X}$ have orthonormal columns, $\Sigma = \text{diag}(\sigma_1,...,\sigma_{r_X}) \in \mathbb{R}^{r_X \times r_X}$ is a diagonal matrix containing the so-called singular values $\sigma_1 \geq \sigma_2 \geq ... \geq \sigma_{r_X} > 0$, and $r_X \leq \min(m,n)$ denotes the rank of $X$ (i.e. the dimension of the subspace of $\mathbb{R}^m$ spanned by the columns of $X$). Since $\Sigma = U^T X V$ is diagonal, one can imagine this decomposition as finding orthogonal coordinate transformations under which $X$ looks "linear". 

In data science, SVD is more often known as **principle component analysis (PCA)**, the columns of $U$ being called the principle components of $X$. In fact, in many applications **truncated SVD/PCA** suffices: to reduce $X$ to the "essential" information, one chooses a truncation rank $0 < r \leq r_X$ and considers the truncated SVD/PCA given by 

$$
X \approx X_r := U_{[:,:r]} \Sigma_{[:r,:r]} V_{[:,:r]}^T
$$

where we have used `numpy`-like notation for selecting only the first $r$ columns of $U$ and $V$, respectively. The rationale behind this is that if the first $r$ singular values of $X$ are much larger than the remaining ones, $X_r$ will still contain all "essential" information contained in $X$; in mathematical terms: 

$$
\lVert X_r - X \rVert_{F}^2 = \sum_{i=r+1}^{r_X} \sigma_i^2, 
$$

where $\lVert \cdot \rVert_F$ denotes the Frobenius norm. Thus, truncated SVD/PCA may be used for, e.g.,  
* filtering away non-essential information in order to get a "feeling" for the main characteristics of your data set, 
* to detect linear (or "almost" linear) dependencies in your data, 
* to generate features for further processing of your data. 

Moreover, there is a plenty of more advanced data analytics and data-based simulation techniques, such as, e.g., Proper Orthogonal Decomposition (POD) or Dynamic Mode Decomposition (DMD), that are based on SVD/PCA. 


In Heat we have currently implemented an algorithm for computing an approximate truncated SVD, where truncation takes place either w.r.t. a fixed truncation-rank (`heat.linalg.hsvd_rank`) or w.r.t. a desired accuracy (`heat.linalg.hsvd_rtol`). In the latter case it can be ensured that it holds for the "reconstruction error": 

$$
\frac{\lVert X - U U^T X \rVert_F}{\lVert X \rVert_F} \overset{!}{\leq} \text{rtol},
$$

where $U$ denotes the approximate left-singular vectors of $X$ computed by `heat.linalg.hsvd_rtol`. 


In [None]:
# generate a random 500 x (1500 * nprocs) matrix with rank 100 on GPU 
nprocs = ht.MPI_WORLD.rank 
X = ht.utils.data.matrixgallery.random_known_rank(500, 1500 * nprocs, 100, split=1, dtype=ht.float32, device="gpu")

# compute truncated SVD w.r.t. relative tolerance 
svd_with_reltol = ht.linalg.hsvd_rtol(X,rtol=2.5e-2,compute_sv=True,silent=False)
print("relative residual:", svd_with_reltol[4], "rank: ", svd_with_reltol[0].shape[1])

# compute truncated SVD w.r.t. a fixed truncation rank 
svd_with_rank = ht.linalg.hsvd_rank(X, maxrank=50,compute_sv=True,silent=False)
print("relative residual:", svd_with_rank[4], "rank: ", svd_with_rank[0].shape[1])


**References for hierarchical SVD**

1. Iwen, Ong. *A distributed and incremental SVD algorithm for agglomerative data analysis on large networks.* SIAM J. Matrix Anal. Appl., **37** (4), 2016.
2. Himpe, Leibner, Rave. *Hierarchical approximate proper orthogonal decomposition.* SIAM J. Sci. Comput., **4** (5), 2018.
3. Halko, Martinsson, Tropp. *Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.* SIAM Rev. 53, **2** (2011)

# QR decomposition

# work in progress