# Loading and Preprocessing

### Refresher

Using PyTorch as compute engine and mpi4py for communication, Heat implements a number of array operations and algorithms that are optimized for memory-distributed data volumes. This allows you to tackle datasets that are too large for single-node (or worse, single-GPU) processing. 

As opposed to task-parallel frameworks, Heat takes a data-parallel approach, meaning that each "worker" or MPI process performs the same tasks on different slices of the data. Many operations and algorithms are not embarassingly parallel, and involve data exchange between processes. Heat operations and algorithms are designed to minimize this communication overhead, and to make it transparent to the user.

In other words: 
- you don't have to worry about optimizing data chunk sizes; 
- you don't have to make sure your research problem is embarassingly parallel, or artificially make your dataset smaller so your RAM is sufficient; 
- you do have to make sure that you have sufficient **overall** RAM to run your global task (e.g. number of nodes / GPUs).

The following shows some I/O and preprocessing examples. We'll use small datasets here as each of us only has access to one node only.

### I/O

Let's start with loading a data set. Heat supports reading and writing from/into shared memory for a number of formats, including HDF5, NetCDF, and because we love scientists, csv. Check out the `ht.load` and `ht.save` functions for more details. Here we will load data in [HDF5 format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format).

This particular example data set (generated from all Asteroids from the [JPL Small Body Database](https://ssd.jpl.nasa.gov/sb/)) is really small, but it allows to demonstrate the basic functionality of Heat. 
 

Your ipcluster should still be running (see the [Intro](0_setup/0_setup_local.ipynb)). Let's test it:

In [None]:
from ipyparallel import Client
rc = Client(profile="default")
rc.ids

[0, 1, 2, 3]

The above cell should return [0, 1, 2, 3].

Now let's import `heat` and load the data set.

In [None]:
%%px
import heat as ht
import sklearn
import sklearn.datasets

X,_ = sklearn.datasets.load_digits(return_X_y=True)
X = ht.array(X, split=0)
X


[0;31mOut[0:54]: [0m<DNDarray(MPI-rank: 0, Shape: (1797, 64), Split: 0, Local Shape: (450, 64), Device: cpu:0, Dtype: float64)>

[0;31mOut[2:36]: [0m<DNDarray(MPI-rank: 2, Shape: (1797, 64), Split: 0, Local Shape: (449, 64), Device: cpu:0, Dtype: float64)>

[0;31mOut[3:36]: [0m<DNDarray(MPI-rank: 3, Shape: (1797, 64), Split: 0, Local Shape: (449, 64), Device: cpu:0, Dtype: float64)>

[0;31mOut[1:36]: [0m<DNDarray(MPI-rank: 1, Shape: (1797, 64), Split: 0, Local Shape: (449, 64), Device: cpu:0, Dtype: float64)>

We have loaded the entire data onto 4 MPI processes. We have created `X` with `split=0`, so each process stores evenly-sized slices of the data along dimension 0.

### Data exploration

Let's get an idea of the size of the data.

In [None]:
%%px 
# print global metadata once only
if X.comm.rank == 0:
    print(f"X is a {X.ndim}-dimensional array with shape{X.shape}")
    print(f"X takes up {X.nbytes/1e6} MB of memory.")



[stdout:0] X is a 2-dimensional array with shape(1797, 64)
X takes up 0.920064 MB of memory.


X is a matrix of shape *(datapoints, features)*. 

To get a first overview, we can print the data and determine its feature-wise mean, variance, min, max etc. These are reduction operations along the datapoints dimension, which is also the `split` dimension. You don't have to implement [`MPI.Allreduce`](https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/) operations yourself, communication is handled by Heat operations.

In [None]:
%%px
features_mean = ht.mean(X,axis=0)
features_var = ht.var(X,axis=0)
features_max = ht.max(X,axis=0)
features_min = ht.min(X,axis=0)

if ht.MPI_WORLD.rank == 0:
    print(f"Mean: {features_mean}")
    print(f"Var: {features_var}")
    print(f"Max: {features_max}")
    print(f"Min: {features_min}")

[stdout:0] Mean: DNDarray([0.0000e+00, 3.0384e-01, 5.2048e+00, 1.1836e+01, 1.1848e+01, 5.7819e+00, 1.3623e+00, 1.2966e-01, 5.5648e-03,
          1.9939e+00, 1.0382e+01, 1.1979e+01, 1.0279e+01, 8.1758e+00, 1.8464e+00, 1.0796e-01, 2.7824e-03, 2.6016e+00,
          9.9032e+00, 6.9928e+00, 7.0979e+00, 7.8063e+00, 1.7885e+00, 5.0083e-02, 1.1130e-03, 2.4697e+00, 9.0913e+00,
          8.8214e+00, 9.9271e+00, 7.5515e+00, 2.3178e+00, 2.2259e-03, 0.0000e+00, 2.3395e+00, 7.6672e+00, 9.0718e+00,
          1.0302e+01, 8.7440e+00, 2.9093e+00, 0.0000e+00, 8.9037e-03, 1.5838e+00, 6.8815e+00, 7.2282e+00, 7.6722e+00,
          8.2365e+00, 3.4563e+00, 2.7268e-02, 7.2343e-03, 7.0451e-01, 7.5070e+00, 9.5392e+00, 9.4162e+00, 8.7585e+00,
          3.7251e+00, 2.0646e-01, 5.5648e-04, 2.7935e-01, 5.5576e+00, 1.2089e+01, 1.1809e+01, 6.7641e+00, 2.0679e+00,
          3.6450e-01], dtype=ht.float32, device=cpu:0, split=None)
Var: DNDarray([0.0000e+00, 8.2254e-01, 2.2596e+01, 1.8043e+01, 1.8371e+01, 3.2090e+01, 1.1

Note that the `features_...` DNDarrays are no longer distributed, i.e. a copy of these results exists on each GPU, as the split dimension of the input data has been lost in the reduction operations. 

### Preprocessing/scaling

Next, we can preprocess the data, e.g., by standardizing and/or normalizing. Heat offers several preprocessing routines for doing so, the API is similar to [`sklearn.preprocessing`](https://scikit-learn.org/stable/modules/preprocessing.html) so adapting existing code shouldn't be too complicated.

Again, please let us know if you're missing any features.

In [None]:
%%px
# Standard Scaler
scaler = ht.preprocessing.StandardScaler()
X_standardized = scaler.fit_transform(X)
standardized_mean = ht.mean(X_standardized,axis=0)
standardized_var = ht.var(X_standardized,axis=0)
print(f"Standard Scaler Mean: {standardized_mean}")
print(f"Standard Scaler Var: {standardized_var}")

# Robust Scaler
scaler = ht.preprocessing.RobustScaler()
X_robust = scaler.fit_transform(X)
robust_mean = ht.mean(X_robust,axis=0)
robust_var = ht.var(X_robust,axis=0)

print(f"Robust Scaler Mean: {robust_mean}")
print(f"Robust Scaler Median: {robust_var}")

[stdout:1] At least one of the features is almost constant (w.r.t. machine precision) and will not be scaled for this reason.
Standard Scaler Mean: 
Standard Scaler Var: 
At least one of the features is almost constant (w.r.t. machine precision) and will not be scaled for this reason.
Robust Scaler Mean: 
Robust Scaler Median: 


[stdout:2] At least one of the features is almost constant (w.r.t. machine precision) and will not be scaled for this reason.
Standard Scaler Mean: 
Standard Scaler Var: 
At least one of the features is almost constant (w.r.t. machine precision) and will not be scaled for this reason.
Robust Scaler Mean: 
Robust Scaler Median: 


[stdout:3] At least one of the features is almost constant (w.r.t. machine precision) and will not be scaled for this reason.
Standard Scaler Mean: 
Standard Scaler Var: 
At least one of the features is almost constant (w.r.t. machine precision) and will not be scaled for this reason.
Robust Scaler Mean: 
Robust Scaler Median: 


[stdout:0] At least one of the features is almost constant (w.r.t. machine precision) and will not be scaled for this reason.
Standard Scaler Mean: DNDarray([ 0.0000e+00, -1.0710e-08, -1.1292e-08,  3.9116e-08,  1.0431e-07, -4.6566e-08, -7.4506e-09, -1.8626e-09,
           0.0000e+00, -1.0710e-08, -6.3796e-08, -1.1176e-08, -1.1502e-07, -5.9605e-08, -2.2352e-08,  1.8626e-09,
          -2.7940e-09, -1.6764e-08, -9.5344e-08,  5.5879e-08,  1.3970e-08,  5.5181e-08, -2.9802e-08, -7.4506e-09,
           9.3132e-10, -1.6764e-08,  6.4261e-08, -3.9116e-08, -6.7055e-08, -6.2399e-08, -2.1420e-08, -1.8626e-09,
           0.0000e+00,  0.0000e+00,  2.9802e-08, -9.0338e-08, -1.3970e-09,  3.5390e-08,  2.6077e-08,  0.0000e+00,
          -1.8626e-09,  3.1199e-08, -2.3749e-08, -6.7055e-08, -2.8871e-08, -4.0978e-08, -3.0384e-08, -5.3551e-09,
          -2.7940e-09, -7.4506e-09, -1.0245e-08, -3.7253e-08, -3.7253e-09, -6.3330e-08, -1.8626e-09,  3.7253e-09,
          -2.7940e-09, -3.7253e-09, -4.0513e-08,  7.62

Within Heat, you have several options to apply memory-distributed machine learning algorithms on your data. Check out our dedicated "clustering" notebook for an example.



Is the algorithm you're looking for not yet implemented? [Let us know](https://github.com/helmholtz-analytics/heat/issues/new/choose)! 