## Introduction to Dask

[Dask](https://dask.org/) is a library for scalable computing in Python. It primarily does two things: 

1. Scheduling tasks on a cluster. This helps parallelize intensive computational tasks. 

2. Contains implementations of collections for large datasets. This helps manage large datasets, so that scheduling tasks on them becomes more easy to do.

Let's start with scheduling and how that works. We'll start with a trivial computational task. The following cell implements two simple computational operations. An `inc` function that increments a number by 1, and an `add` function that adds together two numbers. To demonstrate the effects of parallelization, we will also have each one of these functions sleep for a little while, to simulate more intensive time-demanding computations. 

In [None]:
import time

def inc(x):
    time.sleep(1)
    return x + 1

def add(x, y):
    time.sleep(1)
    return x + y

How much time would it take increment two numbers and add them together?

In [None]:
%%time
x1 = inc(1)
x2 = inc(2)
z = add(x1, x2)

## Introducing the `delayed` decorator

First of all, what is a decorator? One way to think about it (but it's not the [full](https://matthew-brett.github.io/pydagogue/decorating_for_dummies.html) [story](https://matthew-brett.github.io/pydagogue/decorating_for_smart_people.html)) is that it's a function that takes a function as input and produces a function as output. 

This means that we can process a function that we have written, so that it does something slightly different than it was originally intended to do. 

In the case of the dask `delayed` decorator, the execution of the function is deferred until a full computational graph can be derived. 

Let's see what that means in practice. Let's create delayed versions of our `add` and `inc` functions:


In [None]:
from dask import delayed

delayed_inc = delayed(inc)
delayed_add = delayed(add)

What are these things? 

In [None]:
print(type(delayed_inc))
print(type(delayed_add))

That's odd. Are these things like a function?

In [None]:
print(callable(delayed_inc))
print(callable(delayed_add))

Looks like they are. What happens when we call them?

In [None]:
%%time
x1 = delayed_inc(1)
x2 = delayed_inc(2)
z = delayed_add(x1, x2)

Whoa. That was fast! Does `z` take the expected value?

In [None]:
z

Hmm. Doesn't look like it. In fact, no computation has occurred so far. Dask has only computed a computational graph that it will execute when we call the object's `compute` method.

In [None]:
z.visualize()

It's only when we call the `compute` method that the work is done.

In [None]:
%%time
z.compute()

The result is emitted and the computation now takes 2 seconds instead of three! 

Things to think about: 
- Why did we go from 3s to 2s? Why weren't we able to parallelize down to 1s?
- What would have happened if the inc and add functions didn't include the sleep(1)? Would Dask still be able to speed up this code?
- What if we have multiple outputs or also want to get access to x or y?    

## Exercise: Parallelize a for loop

`for` loops are one of the most common things that we want to parallelize.  Use `dask.delayed` on `inc` and `sum` to parallelize the computation below:

In [None]:
data = [1, 2, 3, 4, 5, 6, 7, 8]

In [None]:
%%time
# Sequential code

results = []
for x in data:
    y = inc(x)
    results.append(y)
    
total = sum(results)

In [None]:
total

In [None]:
%%time
# Your parallel code here...

In [None]:
%load solutions/01-delayed-loop.py

## Exercise: Parallelizing a for-loop code with control flow

Often we want to delay only *some* functions, running a few of them immediately.  This is especially helpful when those functions are fast and help us to determine what other slower functions we should call.  This decision, to delay or not to delay, is usually where we need to be thoughtful when using `dask.delayed`.

In the example below we iterate through a list of inputs.  If that input is even then we want to call `inc`.  If the input is odd then we want to call `double`.  This `is_even` decision to call `inc` or `double` has to be made immediately (not lazily) in order for our graph-building Python code to proceed.

In [None]:
def double(x):
    sleep(1)
    return 2 * x

def is_even(x):
    return not x % 2

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
%%time
# Sequential code

results = []
for x in data:
    if is_even(x):
        y = double(x)
    else:
        y = inc(x)
    results.append(y)
    
total = sum(results)
print(total)

In [None]:
%%time
# Your parallel code here

In [None]:
%load solutions/01-delayed-control-flow.py