<img src="img/dask_icon.svg" width="25%" align="right">

# Parallel, larger-than-memory objects with Dask


Dask provides a parallel, larger-than-memory, dataframe and n-dimensional array using blocked algorithms.

*  **Parallel**: Uses all of the cores on your computer
*  **Larger-than-memory**:  Lets you work on datasets that are larger than your available memory by breaking up your dataset into many small pieces, operating on those pieces in an order that minimizes the memory footprint of your computation, and effectively streaming data from disk.
*  **Blocked Algorithms**:  Perform large computations by performing many smaller computations

**Related Documentation**

* http://dask.readthedocs.io/en/latest/array.html
* http://dask.readthedocs.io/en/latest/dataframe.html

## Blocked Algorithms

A *blocked algorithm* executes on a large dataset by breaking it up into many small blocks.

For example, consider taking the sum of a billion numbers.  We might instead break up the array into 1,000 chunks, each of size 1,000,000, take the sum of each chunk, and then take the sum of the intermediate sums.

We achieve the intended result (one sum on one billion numbers) by performing many smaller results (one thousand sums on one million numbers each, followed by another sum of a thousand numbers.)

We do exactly this with Python and NumPy in the following example:

#### Create random dataset

In [1]:
import prep

In [2]:
import os
os.chdir('/local')

In [3]:
# create data if it doesn't already exist
prep.random_array()  

Create random data for array exercise


In [5]:
# Load data with h5py
import h5py
import os
f = h5py.File('random.hdf5')
dset = f['/x']

#### Compute sum using blocked algorithm

Here we compute the sum of this large array on disk by 

1.  Computing the sum of each 1,000,000 sized chunk of the array
2.  Computing the sum of the 1,000 intermediate sums

In [6]:
dset

<HDF5 dataset "x": shape (1000000000,), type "<f4">

In [7]:
%%time

# Compute sum of large array, one million numbers at a time
sums = []
for i in range(0, 1000000000, 1000000):
    chunk = dset[i: i + 1000000]  # pull out numpy array
    sums.append(chunk.sum())

total = sum(sums)
print(total)

999993482.375
CPU times: user 596 ms, sys: 928 ms, total: 1.52 s
Wall time: 2.17 s


#### Exercise:  Compute the mean using a blocked algorithm

Now that we've seen the simple example above try doing a slightly more complicated problem, compute the mean of the array.  You can do this by changing the code above with the following alterations:

1.  Compute the sum of each block
2.  Compute the length of each block
3.  Compute the sum of the 1,000 intermediate sums and the sum of the 1,000 intermediate lengths and divide one by the other

This approach is overkill for our case but does nicely generalize if we don't know the size of the array or individual blocks beforehand.

In [8]:
# Compute the mean of the array

In [9]:
sums = []
lengths = []
for i in range(0, 1000000000, 1000000):
    chunk = dset[i: i + 1000000]  # pull out numpy array
    sums.append(chunk.sum())
    lengths.append(len(chunk))

mean = sum(sums) / sum(lengths)
print(mean)

0.999993482375


## `dask.array` contains these algorithms

Dask.array is a NumPy-like library that does these kinds of tricks to operate on large datasets that don't fit into memory.  It extends beyond the linear problems discussed above to full N-Dimensional algorithms and a decent subset of the NumPy interface.

#### Create `dask.array` object

You can create a `dask.array` `Array` object with the `da.from_array` function.  This function accepts

1.  `data`: Any object that supports NumPy slicing, like `dset`
2.  `chunks`: A chunk size to tell us how to block up our array, like `(1000000,)`

In [10]:
import dask.array as da
x = da.from_array(dset, chunks=(1000000,))

In [11]:
x

dask.array<array-1..., shape=(1000000000,), dtype=float32, chunksize=(1000000,)>

#### Manipulate `dask.array` object as you would a numpy array

Now that we have an `Array` we perform standard numpy-style computations like arithmetic, mathematics, slicing, reductions, etc..

The interface is familiar, but the actual work is different. dask_array.sum() does not do the same thing as numpy_array.sum().

#### What's the difference?

`dask_array.sum()` builds an expression of the computation. It does not do the computation yet. `numpy_array.sum()` computes the sum immediately.

#### Why the difference?

Dask arrays are split into chunks. Each chunk must have computations run on that chunk explicitly. If the desired answer comes from a small slice of the entire dataset, running the computation over all data would be wasteful of CPU and memory.

In [12]:
result = x.sum()
result

dask.array<sum-agg..., shape=(), dtype=float32, chunksize=()>

#### Compute result

Dask.array objects are lazily evaluated.  Operations like `.sum` build up a graph of blocked tasks to execute.  

We ask for the final result with a call to `.compute()`.  This triggers the actual computation.

In [19]:
%%time
result.compute()

CPU times: user 858 ms, sys: 1.51 s, total: 2.37 s
Wall time: 2.19 s


9.9999347e+08

#### Exercise:  Compute the mean

And the variance, std, etc..  This should be a trivial change to the example above.

Look at what other operations you can do with the Jupyter notebook's tab-completion.

In [20]:
x.mean().compute()

0.99999344

Does this match your result from before?

## Operating on a set of csv files

Another typical example is having a series of csv files as one dataset on which you want to perform operations. You can do this in for loop (or by concatenating the csv files to one dataframe if it fits in memory), but often this is also parallelizable. 

#### Prepare accounts data

In [3]:
prep.accounts_csvs(3, 10000000, 500)

Create CSV accounts for dataframe exercise


#### Total number of rows

There are three CSV files in your `data` directory. We count how many rows are in all of these csv files total.  In normal Python we solve this problem in the following way.

In [4]:
import pandas as pd

import os
filenames = ['accounts.%d.csv' % i for i in [0, 1, 2]]
filenames

['accounts.0.csv', 'accounts.1.csv', 'accounts.2.csv']

In [5]:
pd.read_csv(filenames[0], nrows=5)  # a sample of the first file

Unnamed: 0,id,names,amount
0,95,Frank,1656
1,323,Sarah,568
2,476,Frank,961
3,69,Ingrid,1508
4,488,Laura,23


In [6]:
%%time 

lengths = []

for fn in filenames:
    df = pd.read_csv(fn)
    lengths.append(len(df))
    
total = sum(lengths)
print(total)

30000000
CPU times: user 7.13 s, sys: 1.44 s, total: 8.57 s
Wall time: 8.9 s


## `dask.dataframe` to hold set of csv files

The `dask.dataframe` module implements a blocked parallel `DataFrame` object that mimics a subset of the Pandas `DataFrame`. One dask `DataFrame` is comprised of several pandas `DataFrames` separated along the index. One operation on a dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.

#### Create `dask.dataframe`

We can use dask's `read_csv`. This works just like `pandas.read_csv`, except on multiple csv files at once.

In [7]:
import dask.dataframe as dd

In [8]:
df = dd.read_csv("accounts.*.csv")

In [9]:
df

dd.DataFrame<from-de..., npartitions=9>

#### Compute the total number of rows

In [10]:
%%time
len(df)

CPU times: user 7.67 s, sys: 5.79 s, total: 13.5 s
Wall time: 4.52 s


30000000

## Conclusions

In those two examples we saw how we could replace `for` loops to walk through the dataset one block at a time with convenient dask methods. This gives simplified code, but also the ability to parallellize the computations.
Dask translates your dataframe/array operations into a graph of inter-related tasks with data dependencies between them.  Dask then executes this graph in parallel with multiple threads.


In the next notebook we will take more in depth look into how dask does this: task graphs. 
