# Dask Array

This notebook gives a quick demo of using `dask.array`. It is not intended to be a full tutorial on using dask, or a full demonstration of its capabilities. For more information see the docs [here](http://dask.pydata.org/en/latest/), or the tutorial [here](https://github.com/dask/dask-tutorial).

## Create an array

In [None]:
# create data if it doesn't already exist
from utils import random_array
random_array()

Load the data with `h5py`:

In [None]:
import h5py
dset = h5py.File('data/random.hdf5')['/x']
dset

Note that none of the data is in memory yet, the `dset` object just points to data that's on disk.

### Create a dask array
Dask arrays can be created in a few different ways, but here we'll use the `from_array` function.

`from_array` takes the following parameters

- `data`: Any object that supports NumPy slicing. Here we'll use an hdf5 array
- `chunks`: A chunk size to tell us how to block up our array, like (1000000,)

In [None]:
import dask.array as da
x = da.from_array(dset, chunks=2500)
x

### Examine the array
`a` now is a `dask.array.Array` object, which looks a lot like a numpy array. It has many of the same methods even:

In [None]:
# Pretty print out the Array methods
import textwrap
print('\n'.join(textwrap.wrap(str([f for f in dir(x) if not f.startswith('_')]))))

In [None]:
x.shape

In [None]:
x.dtype

In [None]:
x.nbytes / 1e9

Computations done on the array are not run immediately, but recorded in a graph as a `dask` attribute on the array. This is common for all dask collections (`dask.array`, `dask.dataframe`, `dask.bag` and `dask.delayed`). To see the graph, one can use the `visualize` method:

In [None]:
x.visualize()

The graph of `x` just shows several slices being taken out of `dset` - one for each chunk. Lets create some larger computations and visualize them:

### Add `x` with its transpose

In [None]:
expr = x + x.T
expr.visualize()

### Then sum along an axis

In [None]:
expr = (x + x.T).sum(axis=0)
expr.visualize()

### Then matrix multiply with `x`

In [None]:
expr = (x + x.T).sum(axis=0).dot(x)
expr.visualize()

### Dask implements some of `np.linalg`:

Note that it won't beat highly optimized parallel linear algebra libraries like [elemental](http://libelemental.org/). The flexibility can be useful though.

It also makes really pretty graphs:

In [None]:
da.linalg.inv(x).visualize()

That's quite a complicated graph!

### Graph optimizations

Before the computation is actually run (by calling the `compute` method on the array), the graph is simplified with several optimization passes to improve efficiency. To visualize the final graph to be run, we can set the `optimize_graph` keyword as `True`.

For clarity, lets go back to a simpler computation:

In [None]:
expr = (x + x.T).sum(axis=0)
expr.visualize()

In [None]:
expr.visualize(optimize_graph=True)

This is a much simpler computation, and is the result of inlining, fusing, and rewriting tasks.

### Perform the actual computation
To actually run the computation, we can call the `compute` method. 

We'll also load some diagnostic tools so we can profile the computation afterwards. To read more about the diagnostic options in dask, see the docs [here](http://dask.pydata.org/en/latest/diagnostics.html).

In [None]:
from dask.diagnostics import Profiler, ResourceProfiler, ProgressBar, visualize
from bokeh.io import output_notebook
ProgressBar().register()
output_notebook()

In [None]:
with ResourceProfiler(dt=0.5) as rprof, Profiler() as prof:
    out = expr.compute(num_workers=4)

Here we'll plot a profile of the computation as executed by dask. The top plot has a box for each task showing the duration of the task (hover over the box with your mouse to see the task description). The bottom plot shows our resource usage.

As can be seen in this plot, we used 400% cpu, indicating all 4 cores of my computer were fully used.

In [None]:
visualize([prof, rprof]);

Note that while the whole array is 4 GB, the max memory used was `~600 MB`, when the whole dataset was `1.25 GB`. This is because the blocked algorithms implemented in dask allow the computations to be done out-of-core. This allows you to work with data that exceeds the amount of RAM on your machine.