<img src="images/dask_horizontal.svg"
     width="45%"
     alt="Dask logo\">
     
# Parallel Computing in Python with Dask

This notebook provides a high-level overview of Dask. We discuss why you might want to use Dask, high-level and low-level APIs for generating computational graphs, and Dask's schedulers which enable the parallel execution of these graphs.

# Overview

[Dask](https://docs.dask.org) is a flexible, [open source](https://github.com/dask/dask) library for parallel and distributed computing in Python. Dask is designed to scale the existing Python ecosystem.

You might want to use Dask because it:

- Enables parallel and larger-than-memory computations

- Uses familiar APIs you're used to from projects like NumPy, pandas, and scikit-learn

- Allows you to scale existing workflows with minimal code changes

- Dask works on your laptop, but also scales out to large clusters

- Offers great built-in diagnosic tools

### Components of Dask

From a high level, Dask is comprised of two main components:

1. **Dask collections** which extend common interfaces like NumPy, pandas, and Python iterators to larger-than-memoty or distributed environments by creating *task graphs*
2. **Schedulers** which compute task graphs produced by Dask collections in parallel

<img src="images/dask-overview.png"
     width="85%"
     alt="Dask components\">

### Task Graphs

In [None]:
def inc(i):
    return i + 1

def add(a, b):
    return a + b

a, b = 1, 12
c = inc(a)
d = inc(b)
output = add(c, d)

print(f'output = {output}')

This computation can be encoded in the following task graph:

![](images/inc-add.png)


- Graph of inter-related tasks with dependencies between them

- Circular nodes in the graph are Python function calls

- Square nodes are Python objects that are created by one task as output and can be used as inputs in another task

# Dask Collections

Let's looks at two Dask user interfaces: Dask Array and Dask Delayed.

## Dask Arrays

- Dask arrays are chunked, n-dimensional arrays

- Can think of a Dask array as a collection of NumPy `ndarray` arrays

- Dask arrays implement a large subset of the NumPy API using blocked algorithms

- For many purposes Dask arrays can serve as drop-in replacements for NumPy arrays

<img src="images/dask-array.png" width="50%">

In [None]:
import numpy as np
import dask.array as da

In [None]:
x_np = np.random.random(size=(1_000, 1_000))
x_np

We can create a Dask array in a similar manner, but need to specify a `chunks` argument to tell Dask how to break up the underlying array into chunks.

In [None]:
x = da.random.random(size=(1_000, 1_000), chunks=(250, 500))

In [None]:
x    # Dask arrays have nice HTML output in Jupyter notebooks

Dask arrays look and feel like NumPy arrays. For example, they have `dtype` and `shape` attributes

In [None]:
print(x.dtype)
print(x.shape)

Dask arrays are _lazily_ evaluated. The result from a computation isn't computed until you ask for it. Instead, a Dask task graph for the computation is produced. You can visualize the task graph using the `visualize()` method.

In [None]:
x.visualize()

To compute a task graph call the `compute()` method

In [None]:
result = x.compute()    # We'll go into more detail about .compute() later on
result

The result of this computation is a fimilar NumPy `ndarray`

In [None]:
type(result)

Dask arrays support a large portion of the NumPy interface:

- Arithmetic and scalar mathematics: `+`, `*`, `exp`, `log`, ...

- Reductions along axes: `sum()`, `mean()`, `std()`, `sum(axis=0)`, ...

- Tensor contractions / dot products / matrix multiply: `tensordot`

- Axis reordering / transpose: `transpose`

- Slicing: `x[:100, 500:100:-2]`

- Fancy indexing along single axes with lists or numpy arrays: `x[:, [10, 1, 5]]`

- Array protocols like `__array__` and `__array_ufunc__`

- Some linear algebra: `svd`, `qr`, `solve`, `solve_triangular`, `lstsq`, ...

- ...

See the [Dask array API docs](http://docs.dask.org/en/latest/array-api.html) for full details about what portion of the NumPy API is implemented for Dask arrays.

We can build more complex computations using the familiar NumPy operations we're used to.

In [None]:
result = (x + x.T).sum(axis=0).mean()

In [None]:
result.visualize()

In [None]:
result.compute()

**Note**: Dask can be used to scale other array-like libraries that support the NumPy `ndarray` interface. For example, [pydata/sparse](https://sparse.pydata.org/en/latest/) for sparse arrays or [CuPy](https://cupy.chainer.org/) for GPU-accelerated arrays.

## Dask Delayed

Sometimes problems don’t fit nicely into one of the high-level collections like Dask arrays or Dask DataFrames. In these cases, you can parallelize custom algorithms using the lower-level Dask `delayed` interface. This allows one to manually create task graphs with a light annotation of normal Python code.

In [None]:
import time
import random

def inc(x):
    time.sleep(random.random())
    return x + 1

def double(x):
    time.sleep(random.random())
    return 2 * x
    
def add(x, y):
    time.sleep(random.random())
    return x + y 

In [None]:
%%time

data = [1, 2, 3, 4]

output = []
for i in data:
    a = inc(i)
    b = double(i)
    c = add(a, b)
    output.append(c)

total = sum(output)

Dask `delayed` wraps function calls and delays their execution. `delayed` functions record what we want to compute (a function and input parameters) as a task in a graph that we’ll run later on parallel hardware by calling `compute`.

In [None]:
from dask import delayed

In [None]:
@delayed
def lazy_inc(x):
    time.sleep(random.random())
    return x + 1

In [None]:
lazy_inc

In [None]:
inc_output = lazy_inc(3)  # lazily evaluate inc(3)
inc_output

In [None]:
inc_output.compute()

Using `delayed` functions, we can build up a task graph for the particular computation we want to perform

In [None]:
double_inc_output = lazy_inc(inc_output)
double_inc_output

In [None]:
double_inc_output.visualize()

In [None]:
double_inc_output.compute()

We can use `delayed` to make our previous example computation lazy by wrapping all the function calls with delayed

In [None]:
import time
import random

@delayed
def inc(x):
    time.sleep(random.random())
    return x + 1

@delayed
def double(x):
    time.sleep(random.random())
    return 2 * x

@delayed
def add(x, y):
    time.sleep(random.random())
    return x + y

In [None]:
%%time

data = [1, 2, 3, 4]

output = []
for i in data:
    a = inc(i)
    b = double(i)
    c = add(a, b)
    output.append(c)

total = delayed(sum)(output)
total

In [None]:
total.visualize()

In [None]:
%%time

total.compute()

We highly recommend checking out the [Dask delayed best practices](http://docs.dask.org/en/latest/delayed-best-practices.html) page to avoid some common pitfalls when using `delayed`. 

# Schedulers

High-level collections like Dask arrays and Dask DataFrames, as well as the low-level `dask.delayed` interface build up task graphs for a computation. After these graphs are generated, they need to be executed (potentially in parallel). This is the job of a task scheduler. Different task schedulers exist within Dask. Each will consume a task graph and compute the same result, but with different performance characteristics. 

![grid-search](images/animation.gif "grid-search")


Dask has two different classes of schedulers: single-machine schedulers and a distributed scheduler.

## Single Machine Schedulers

Single machine schedulers provide basic features on a local process or thread pool and require no setup (only use the Python standard library). The different single machine schedulers Dask provides are:

- `'threads'`: The threaded scheduler executes computations with a local `concurrent.futures.ThreadPoolExecutor`. The threaded scheduler is the default choice for Dask arrays, Dask DataFrames, and Dask delayed. 

- `'processes'`: The multiprocessing scheduler executes computations with a local `concurrent.futures.ProcessPoolExecutor`.

- `'single-threaded'`: The single-threaded synchronous scheduler executes all computations in the local thread, with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes.

You can configure which scheduler is used in a few different ways. You can set the scheduler globally by using the `dask.config.set(scheduler=)` command

In [None]:
import dask

dask.config.set(scheduler='threads')
x.compute(); # Will use the multi-threading scheduler

or use it as a context manager to set the scheduler for a block of code

In [None]:
with dask.config.set(scheduler='processes'):
    x.compute()  # Will use the multi-processing scheduler

or even within a single compute call

In [None]:
x.compute(scheduler='threads');  # Will use the multi-threading scheduler

The `num_workers` argument is used to specify the number of threads or processes to use

In [None]:
x.compute(scheduler='threads', num_workers=4);

## Distributed Scheduler

Despite having "distributed" in it's name, the distributed scheduler works well on both single and multiple machines. Think of it as the "advanced scheduler".

A Dask distributed cluster is composed of a single centralized scheduler and one or more worker processes. A `Client` object is used as the user-facing entry point to interact with the cluster. We will talk about the components of Dask clusters in more detail later on in [4-distributed-scheduler.py](4-distributed-scheduler.py).

<img src="images/dask-cluster.png"
     width="85%"
     alt="Dask components\">

The distributed scheduler has many features:

- [Real-time, `concurrent.futures`-like interface](https://docs.dask.org/en/latest/futures.html)

- [Sophisticated memory management](https://distributed.dask.org/en/latest/memory.html)

- [Data locality](https://distributed.dask.org/en/latest/locality.html)

- [Adaptive deployments](https://distributed.dask.org/en/latest/adaptive.html)

- [Cluster resilience](https://distributed.dask.org/en/latest/resilience.html)

- ...

See the [Dask distributed documentation](https://distributed.dask.org) for full details about all the distributed scheduler features.

In [None]:
from dask.distributed import Client

# Creates a local Dask cluster
client = Client()
client

In [None]:
x = da.ones((20_000, 20_000), chunks=(400, 400))
result = (x + x.T).sum(axis=0).mean()

In [None]:
result.compute()

In [None]:
client.close()

# Next steps

Next, let's learn more about performing custom operations on Dask collections in the [2-custom-operations.ipynb](2-custom-operations.ipynb) notebook.