# What is Dask

* [Dask](https://dask.org/) is a free and open-source parallel computing library that scales the existing Python ecosystem.
* Dask helps you scale your data science and machine learning workflows. 
* Dask makes it easy to work with Numpy, pandas, and Scikit-Learn etc.
* Dask is a framework to build distributed applications.
* Dask can scale down to your laptop and up to a cluster. We will use today on an environment you can set up on your computer.


Dask can be split into **two components**:

* **Collections**:  

Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas. The advantage is that in can run in parallel on data that cannot fit in memory.
* **Schedulers**:

Dask provides schedulers to run the tasks in parallel

## Examples
We will go over some concepts of Dask that we will need today.

### Dask Array

Dask arrays combine many [NumPy](https://numpy.org/) arrays, arranged into chunks within a grid

Create an array of numbers represented by several NumPy arrays of size 10x10 (the arrays will be smaller if they cannot be divided evenly).

In [1]:
import dask.array as da
x = da.random.random((100, 100), chunks=(10, 10))
x

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (100, 100) (10, 10) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray


Use NumPy syntax for operation

In [2]:
y = x + x.T
z = y[::2, 50:].mean(axis=1)
z

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Count,430 Tasks,10 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 400 B 40 B Shape (50,) (5,) Count 430 Tasks 10 Chunks Type float64 numpy.ndarray",50  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Count,430 Tasks,10 Chunks
Type,float64,numpy.ndarray


To get the result as a NumPy array

In [3]:
z.compute()

array([0.9955181 , 0.99849852, 0.82484775, 0.9289099 , 0.95525755,
       0.93982408, 1.00611189, 1.02421825, 1.0200419 , 0.98427177,
       0.98806315, 1.01754908, 1.03501315, 1.03424154, 0.9587375 ,
       0.93308796, 0.97090667, 1.08354844, 0.97808345, 1.1326681 ,
       0.92900706, 1.05189586, 0.91982236, 1.00712223, 0.93648723,
       0.9545439 , 1.08825738, 1.03141199, 0.98694143, 1.11813931,
       1.01205872, 1.07404602, 1.05633192, 0.96106003, 1.04912675,
       0.99491119, 0.91234236, 1.04380034, 1.17263988, 0.98348018,
       0.98822848, 0.94419134, 0.96293702, 1.07468751, 1.1075791 ,
       1.0076852 , 1.04794827, 0.98455103, 1.00013889, 1.00165424])

Depending on the available RAM, you can persist data in memory to speed up further computation

In [None]:
y = y.persist()

To find the time it takes to perform an operation 

In [4]:
%time y.sum().compute()

CPU times: user 65.9 ms, sys: 17 ms, total: 82.9 ms
Wall time: 81.3 ms


10070.477327750283

## Dask Delayed 

``for`` loops are a common part to parallelize e.g. iterate over all the 2D-planes of a Z-stack.

Below we show how to parallelize sequential incrementation of each value using ``dask.delayed``.

In [25]:
data = [1, 2, 3, 4, 5, 6, 7, 8]

In [26]:
from time import sleep

def increment(x):
    sleep(1)
    return x + 1

In [37]:
# Usual way of running
results = []
for x in data:
    y = increment(x)
    results.append(y)
    
total = sum(results)

print("Compute:", total) 

Compute: 44


We will "transform" our function to use ``dask.delayed``. 
The code below will finish **very quickly**. It will record what we want to compute as a task into a graph that will run later on parallel hardware.

In [32]:
from dask import delayed

In [33]:
# No computation happens
results = []

for x in data:
    y = delayed(increment)(x)
    results.append(y)
    
total = delayed(sum)(results)
print("Before computing:", total) 

Before computing: Delayed('sum-367bf53b-0caf-4aff-b76a-82bb052eba9f')


To get the result, we need to invoke the ``compute`` method.

In [34]:
result = total.compute()
print("After computing :", result)  # After it's computed

After computing : 44


There are a few tricks with ``dask.delayed``. Please check [dask.delayed best practises](https://docs.dask.org/en/latest/delayed-best-practices.html)

### Dask cluster 

Your computer will have multiple cores e.g. 4. When writing regular Python code, you are probably only using 1 of them. If you are using Numpy, e.g. for matrix multiplication, you will be using multiple cores because Numpy knows how to it but general Python code doesn't.

Dask cluster allows you to use multiple cores on your computer.
Dask has also a dashboard that you can use to monitor your work.

This time we will use a local cluster to increment each value in the ``data`` array define above.

In [41]:
def prepare_call(client):
    futures = []
    for x in data:
        y = client.submit(increment, x)
        futures.append(y)
    return futures

Create a local cluster

In [42]:
from dask.distributed import Client, LocalCluster

In [43]:
# if you want to specify number of workers etc.
cluster = LocalCluster(n_workers=2, processes=True, threads_per_worker=1)
# or simply 
# custer = LocalCluster()
with Client(cluster) as client:
    # perform code
    futures = prepare_call(client)
    results = client.gather(futures)

print(results)

[2, 3, 4, 5, 6, 7, 8, 9]


### License (BSD 2-Clause)
Copyright (C) 2022 University of Dundee. All Rights Reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.