# Dask Schedulers

In this notebook we demonstrate how to work with dask dataframes. 

* A few words about dask schedulars
* Dask Schedulars on a single machine
    * local thread
    * local processes
    * single thread
* Apply schedular options to weather station data
* Distributed schedulars (local)

-----

- Authors: NCI Virtual Research Environment Team
- Keywords: Dask, Dataframe, schedulars
- Create Date: 2020-May
- Lineage/Reference: This tutorial is referenced to [dask tutorial](https://github.com/dask/dask-tutorial).

----

### Prerequisite

The following modules are needed:

* Pandas
* Dask
* Pyarrow

## Schedulers

In the previous notebooks, we used `dask.delayed` and `dask.dataframe` to parallelize computations.
These work by building a *task graph* instead of executing immediately.
Each *task* represents some function to call on some data, and the full *graph* is the relationship between all the tasks.

When we wanted the actual result, we called `compute`, which handed the task graph off to a *scheduler*.

**Schedulers are responsible for running a task graph and producing a result**.

![](https://raw.githubusercontent.com/dask/dask-org/master/images/grid_search_schedule.gif)

First, there are the single machine schedulers that execute things in parallel using threads or processes (or synchronously for debugging). These are what we've used up until now. Second, there's the `dask.distributed` scheduler, which is newer and has more features than the single machine scheduler.

In this notebook we'll first talk about the different schedulers. Then we'll use the `dask.distributed` scheduler in more depth.

### Local Schedulers

Dask separates computation description (task graphs) from execution (schedulers). This allows you to write code once, and run it locally or scale it out across a cluster.

Dask has two families of task schedulers:

- Single machine scheduler: This scheduler provides basic features on a local process or thread pool. This scheduler was made first and is the default. It is simple and cheap to use, although it can only be used on a single machine and does not scale.

- Distributed scheduler: This scheduler is more sophisticated, offers more features, but also requires a bit more effort to set up. It can run locally or distributed across a cluster.

For different computations you may find better performance with particular scheduler settings. This session helps you understand how to choose between and configure different schedulers, and provides guidelines on when one might be more appropriate.

#### Local Threads

```python
- `dask.config.set(scheduler='threads')`  # overwrite default with threaded scheduler
```

The threaded scheduler executes computations with a local multiprocessing.pool.ThreadPool. It is lightweight and requires no setup. It introduces very little task overhead (around 50us per task) and, because everything occurs in the same process, it incurs no costs to transfer data between tasks. However, due to Python’s Global Interpreter Lock (GIL), this scheduler only provides parallelism when your computation is dominated by non-Python code, as is primarily the case when operating on numeric data in NumPy arrays, Pandas DataFrames, or using any of the other C/C++/Cython based projects in the ecosystem.

The threaded scheduler is the default choice for Dask Array, Dask DataFrame, and Dask Delayed. However, if your computation is dominated by processing pure Python objects like strings, dicts, or lists, then you may want to try one of the process-based schedulers below (we currently recommend the distributed scheduler on a local machine).

#### Local Processes

```python
import dask.multiprocessing
dask.config.set(scheduler='processes')  # overwrite default with multiprocessing scheduler
```

The multiprocessing scheduler executes computations with a local multiprocessing.Pool. It is lightweight to use and requires no setup. Every task and all of its dependencies are shipped to a local process, executed, and then their result is shipped back to the main process. This means that it is able to bypass issues with the GIL and provide parallelism even on computations that are dominated by pure Python code, such as those that process strings, dicts, and lists.

However, moving data to remote processes and back can introduce performance penalties, particularly when the data being transferred between processes is large. The multiprocessing scheduler is an excellent choice when workflows are relatively linear, and so does not involve significant inter-task data transfer as well as when inputs and outputs are both small, like filenames and counts.

#### Single Thread

```python
import dask
dask.config.set(scheduler='synchronous')  # overwrite default with single-threaded scheduler
```

The single-threaded synchronous scheduler executes all computations in the local thread with no parallelism at all. This is particularly valuable for debugging and profiling, which are more difficult when using threads or processes.

For example, when using IPython or Jupyter notebooks, the `%debug`, `%pdb`, or `%prun` magics will not work well when using the parallel Dask schedulers (they were not designed to be used in a parallel computing context). However, if you run into an exception and want to step into the debugger, you may wish to rerun your computation under the single-threaded scheduler where these tools will function properly.

Here we discuss the *local* schedulers - schedulers that run only on a single machine. In each case we change the scheduler used in a few different ways:

- By providing a `scheduler=` keyword argument to `compute`:

```python
max_rain.compute(scheduler='processes'))
# or 
max_rain.compute(scheduler='synchronous')
```

- Using `dask.set_options`:

```python
# Use multiprocessing in this block
with dask.set_options(scheduler='processes'):
    max_rain.compute()
# Use multiprocessing globally
dask.set_options(scheduler='synchronous')
```

Here we repeat a simple dataframe computation from the previous section using the different schedulers:

In [1]:
import os
import dask.dataframe as dd

In [3]:
filename = os.path.join('data', 'weather_stations_ACT','IDCJAC0009_*_*','IDCJAC0009*.csv')
df = dd.read_csv(filename)
# rename column headers
df.columns = ['code','station','year','month','day','rainfall','period','quality']
# Maximum rainfall
max_rain=df.rainfall.max()

In [4]:
max_rain

dd.Scalar<series-..., dtype=float64>

In [5]:
%time _ = max_rain.compute()  # this uses threads by default

CPU times: user 3.96 s, sys: 853 ms, total: 4.81 s
Wall time: 2.63 s


In [9]:
import dask.multiprocessing
%time _ = max_rain.compute(scheduler='processes')  # this uses processes

CPU times: user 427 ms, sys: 304 ms, total: 731 ms
Wall time: 800 ms


In [10]:
%time _ = max_rain.compute(scheduler='synchronous')  # This uses a single thread

CPU times: user 1.22 s, sys: 177 ms, total: 1.4 s
Wall time: 1.45 s


By default the threaded and multiprocessing schedulers use the same number of workers as cores. You can change this using the `num_workers` keyword in the same way that you specified `scheculer` above:

```
max_rain.compute(scheduler='processes', num_workers=2)
```

To see how many cores you have on your computer, you can use `multiprocessing.cpu_count`

In [11]:
from multiprocessing import cpu_count
cpu_count()

96

### Some Questions to Consider:

- How much speedup is possible for this task (hint, look at the graph).
- Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler.
- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?
- Why is the multiprocessing scheduler so much slower here?

---

## In what cases would you want to use one scheduler over another?

http://dask.pydata.org/en/latest/setup/single-machine.html

---

## Distributed Scheduler

The `dask.distributed` system is composed of a single centralized scheduler and many worker processes. [Deploying](http://dask.pydata.org/en/latest/setup.html) a remote Dask cluster involves some additional effort. But doing things locally is just involves creating a `Client` object, which lets you interact with the "cluster" (local threads or processes on your machine). For more information see [here](http://dask.pydata.org/en/latest/setup/single-distributed.html).

In [13]:
from dask.distributed import Client

# Setup a local cluster.
# By default this sets up 1 worker per core
client = Client()
client.cluster

Be sure to click the `Dashboard` link to open up the diagnostics dashboard. If you run this notebook under Pangeo environment on Gadi, the server port number is in the second line of the `client.cmd` file.

By default, creating a `Client` makes it the default scheduler. Any calls to `.compute` will use the cluster your `client` is attached to (See http://dask.pydata.org/en/latest/scheduling.html for how to specify which scheduler to use).

In [14]:
%time max_rain.compute()

CPU times: user 774 ms, sys: 55.4 ms, total: 829 ms
Wall time: 2.65 s


322.1

#### Some Questions to Consider

- How does this compare to the optimal parallel speedup?
- Why is this faster than the threaded scheduler?

---

### Exercise

Run the following computations while looking at the diagnostics page. In each case what is taking the most time?

In [None]:
### 1.) How many rows are in our dataset?
_ = len(dd)

In [None]:
### 2.) In total, how many days records were taken?
_ = pd.read_parquet('data/ACT_weather.parquet', columns=['period'], engine='pyarrow')

In [None]:
### 3.) In total, how many non-record days from each weather station?
_ = dd.groupby("station").period.count().compute()

In [None]:
### 4.) What was the average rainfall from each station?
_ = dd.groupby("station").rainfall.mean().compute()

---

### More...

The distributed scheduler is more sophisticated than the single machine schedulers. It can compute asynchronously, and also provides an api similar to that of `concurrent.futures`. For further information you can see the docs http://distributed.readthedocs.io/en/latest/.

## Reference

https://docs.dask.org/en/latest/scheduling.html