# Dask arrays

Let's talk about the numpy-like interface.

Let's say that we'd like to compute the tsnr over several runs of fMRI data, for example, using the open-fMRI dataset ds000114. We've already downloaded a session's worth of data to the hub in this location

In [None]:
from glob import glob

fnames = glob('/home/jovyan/shared/ds000114/sub-01/ses-test/func/*.nii.gz')

In [None]:
fnames

This is a list with 5 different file-names, for the different runs during this session. One way to calculate the tsnr across is to loop over the files, read in the data for each one of them, concatenate the data and then compute the tsnr from the concatenated series. Note that we are using the illustrious nibabel library to read the data from disk. Nibabel uses "lazy loading". That means that data are not read from file when the nibabel `load` function is called on a file-name. Instead, Nibabel waits until we ask for the data, using the `get_fdata` method of The `Nifti1Image` class to read the data from file. Here, we do that immediately, but we'll use this fact a little later to our advantage.

In [None]:
%%time
import numpy as np
import nibabel as nib
data = []
for fname in fnames:
    data.append(nib.load(fname).get_fdata())

data = np.concatenate(data, -1)
tsnr = data.mean(-1) / data.std(-1)

When we do that, most of the time is spent on reading the data from file.
As you can probably reason yourself, the individual items in the data
list have no dependency on each other, so they could be calculated in
parallel.

How do we approach the problem in this case? Because of nibabel's lazy-loading, we can instruct it to wait with the call to `get_fdata`. We create a delayed function that we call `delayed_get_fdata`.


In [None]:
from dask import delayed
delayed_get_fdata = delayed(nib.Nifti1Image.get_fdata)

Then, we use this function to create a list of items and delay each one of the computations on this list:

In [None]:
data = []
for fname in fnames:
    data.append(delayed_get_fdata(nib.load(fname)))

data = delayed(np.concatenate)(data, -1)
tsnr = delayed(data.mean)(-1) / delayed(data.std)(-1)

We can see what the graph is for this computation

In [None]:
tsnr.visualize()

Indeed computing tsnr this way can give you an approximately 2-fold speedup. This is because Dask allows the Python process to read several of the files in parallel, and that is the performance bottle-neck here.

In [None]:
%%time
result = tsnr.compute()

### Dask arrays

This is already quite useful, but wouldn't you rather just tell dask that you are going to create some data and to treat it all as delayed until you are ready to compute the tsnr?

This idea is implemented in the dask array interface. The idea here is that you create something that provides all of the interfaces of a numpy array, but all the computations are treated as delayed.

This is what it would look like for the tsnr example. Instead of appending delayed calls to `get_fdata` into the array, we create a series of dask arrays, with `delayed_get_fdata`. We do need to know both the shape and data type of the arrays that will eventually be read, but

In [None]:
import dask.array as da

delayed_arrays = []
for fname in fnames:
    img = nib.load(fname)
    delayed_arrays.append(da.from_delayed(delayed_get_fdata(img),
                          img.shape,
                          img.get_data_dtype()))

In [None]:
delayed_arrays

These are notional arrays, that have not been materialized yet. The data has not been read from memory yet, although dask already knows where it would put them when they should be read.

But they have a lot of properties that real arrays have, albeit with the dask array interface, instead of the one from numpy. For example, we can use the `dask.array.concatenate` function with them:

In [None]:
arr = da.concatenate(delayed_arrays, -1)

This array has some attributes that look just like numpy array attributes:

In [None]:
arr.shape

On the other hand, when we ask to see it, we get a lot of information, but none of the data

In [None]:
arr

Nevertheless, we can then use methods of the `dask.array` object to complete the computation. The computation looks just like the computation we did with the numpy array!

In [None]:
tsnr = arr.mean(-1) / arr.std(-1)

In [None]:
tsnr

In [None]:
tsnr.visualize()

This looks really complicated, but notice that because dask has even more insight into what we are trying to do, it can delay some things until
after aggregation. For example, the square root computation of the standard deviation can be done once at the end, instead of on each array separately.

And this leads to an *additional* approximately 2-fold speedup.

One of the main things to notice about the dask array is that because the data is not read into memory it can represent very large datasets, and schedule operations over these large datasets in a manner that makes the code seem as though all the data is in memory.

In [None]:
%%time
result = tsnr.compute()