<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


# Schedulers

In the previous two notebooks we used Dask.delayed and Dask.dataframe to create computations.  By default, these ran in a local thread pool on our personal machines.  Often, this is sufficient, especially when you are bound by NumPy and Pandas routines which release the GIL and when you are using powerful workstation computers with many cores.

However sometimes you may want to execute your code in processes (for Pure Python code that holds onto the GIL), in a single thread (for profiling and debugging) or across a cluster (for larger computations).

In this section we first talk about changing schedulers.  Then we use the `dask.distributed` scheduler in more depth.

### Local Schedulers

Dask separates computation description (task graphs) from execution (schedulers). This allows you to write code once, and run it locally or scale it out across a cluster.

Here we discuss the *local* schedulers - schedulers that run only on a single machine. The three options here are:

- `dask.threaded.get         # uses a local thread pool`
- `dask.multiprocessing.get  # uses a local process pool`
- `dask.get                  # uses only the main thread (useful for debugging)`

In each case we change the scheduler used in a few different ways:

- By providing a `get=` keyword argument to `compute`:

```python
total.compute(get=dask.multiprocessing.get)
# or 
dask.compute(a, b, get=dask.multiprocessing.get)
```

- Using `dask.set_options`:

```python
# Use multiprocessing in this block
with dask.set_options(get=dask.multiprocessing.get):
    total.compute()
# Use multiprocessing globally
dask.set_options(get=dask.multiprocessing.get)
```

Here we repeat a simple dataframe computation from the previous section using the different schedulers:

In [None]:
import dask 
import dask.multiprocessing
import dask.dataframe as dd
import pandas as pd
from glob import glob
import os

In [None]:
df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': object,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})

# Maximum non-cancelled delay
largest_delay = df[~df.Cancelled].DepDelay.max()

In [None]:
%time _ = largest_delay.compute()  # this uses threads by default

In [None]:
%time _ = largest_delay.compute(get=dask.multiprocessing.get)  # this uses processes

In [None]:
%time _ = largest_delay.compute(get=dask.get)  # This uses a single thread

By default the threaded and multiprocessing schedulers use the same number of workers as cores. You can change this using the `num_workers` keyword in the same way that you specified `get` above:

```
largest_delay.compute(get=dask.multiprocessing.get, num_workers=2)
```

To see how many cores you have on your computer, you can use `multiprocessing.cpu_count`

In [None]:
from multiprocessing import cpu_count
cpu_count()

### Some Questions to Consider:

- How much speedup is possible for this task (hint, look at the graph).
- Given how many cores are on this machine, how much faster could the parallel schedulers be than the single-threaded scheduler.
- How much faster was using threads over a single thread? Why does this differ from the optimal speedup?
- Why is the multiprocessing scheduler so much slower here?

---

## In what cases would you want to use one scheduler over another?

http://dask.pydata.org/en/latest/scheduler-choice.html

---

## Profiling

*You should skip this section if you are running low on time*.

The synchronous scheduler is particularly valuable for debugging and profiling.  

For example, the IPython `%%prun` magic gives us profiling information about which functions take up the most time in our computation.  Try this magic on the computation above with different schedulers.  How informative is this magic when running parallel code?

In [None]:
%prun _ = largest_delay.compute(get=dask.threaded.get)

In [None]:
%prun _ = largest_delay.compute(get=dask.get)

To aid in profiling parallel execution, dask provides several [`diagnostics`](http://dask.pydata.org/en/latest/diagnostics.html) for measuring and visualizing performance. These are useful for seeing bottlenecks in the *parallel* computation, whereas the above `prun` is useful for seeing bottlenecks in individual *tasks*.

In [None]:
from dask.diagnostics import Profiler, ResourceProfiler, visualize
from bokeh.io import output_notebook
output_notebook()

with Profiler() as p, ResourceProfiler(0.25) as r:
    largest_delay.compute()
    
visualize([r, p]);

From the plot above, we can see that while tasks are running concurrently, due to GIL effects we're only achieving parallelism during early parts of `pd.read_csv` (mostly the byte operations).


*It should be noted that the `dask.diagnostics` module is only useful when profiling on a single machine. The `dask.distributed` scheduler has its own set of diagnostics..*

---

## Distributed Scheduler

The `dask.distributed` system is composed of a single centralized scheduler and several worker processes.  We can either set these up manually as command line processes or have Dask set them up for us from the notebook.  


#### Automatically setup a local cluster

Starting a single scheduler and worker on the local machine is a common case. Dask will set up a local cluster for you if you provide no scheduler address to `Client`:

```python
from dask.distributed import Client
client = Client()
```

If you choose this approach then there is no need to set up a `dask-scheduler` or `dask-worker` process as described below.


#### Manually setup a distributed cluster

It is good to know how to set things up manually in case you want to try out Dask on a small cluster of your own.  However, if you are unfamiliar with the command line you can safely skip this section and go down to the Automatic section below. We run the `dask-scheduler` process on one machine.

```
$ dask-scheduler
distributed.scheduler - INFO -   Scheduler at:  tcp://127.0.0.1:8786
distributed.bokeh.application - INFO - Web UI: http://127.0.0.1:8787/status/
```

The scheduler reports that it is running at 127.0.0.1 on port 8786.  The address 127.0.0.1 is another name for "localhost" or "this machine".  We will need to give this address to the workers and client so you might want to copy it now.

We now run the `dask-worker` process on every machine that we want to use for computation.  For right now this is probably just once on our laptop, but in production this may be on many different machines:

```
$ dask-worker 127.0.0.1:8786
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:45011
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:8786
```

*Note*: for this tutorial we want to start the dask-worker process in the `dask-tutorial-pydata-seattle/` directory.  This will ensure that the workers Python processes have access to the same data that our notebook process does.

```
$ cd dask-tutorial-pydata-seattle/  # navigate to whereever you have started your notebooks
~/dask-tutorial-pydata-seattle/ $ dask-worker 127.0.0.1:8786
```

*Note*: By default the dask-worker command line tool starts a single process with a thread pool with as many threads as you have cores on your computer.  If you are doing mostly GIL-released computations (numpy, pandas, scikit-learn) then this is the right choice.  However if you are doing mostly GIL-bound comptutations (Python code, pandas with text, parsing) then you will want to start the workers with multiple processes and one thread per process

```
$ dask-worker 127.0.0.1:8786 --nprocs 8 --nthreads 4
```
    
You can see more options by asking for help

```
$ dask-worker --help
```
    
When the scheduler and workers are running you can connect to them using a Dask `Client`, giving it the same address of the scheduler that you gave to the worker.

```python
from dask.distributed import Client
client = Client('127.0.0.1:8786')
```

### More information

You can find more information at the following documentation pages:

- [Quickstart](http://distributed.readthedocs.io/en/latest/quickstart.html)
- [Setup Network](http://distributed.readthedocs.io/en/latest/setup.html)

## Using a local cluster

As mentioned above, the multiprocessing scheduler can be inefficient for complicated workflows. The distributed scheduler doesn't have this downside, and works fine locally. This makes it often a good replacement for the multiprocessing scheduler, even when working on a single machine.

Here we startup a local cluster, and use it to repeat the same dataframe computation as done above:

In [None]:
from dask.distributed import Client

# Setup a local cluster.
# By default this sets up 1 worker per core
client = Client()
client

In [None]:
%time _ = largest_delay.compute(get=client.get)

#### Some Questions to Consider

- How does this compare to the optimal parallel speedup?
- Why is this faster than the threaded scheduler?

### Client takes over by default

Actually, we didn't need to add `get=client.get`.  The distributed scheduler takes over as the default scheduler for all collections when the Client is created:

In [None]:
%time _ = largest_delay.compute()  # This used to use threads by default, now it uses dask.distributed

---

## Diagnostics

One of the main advantages of using the distributed scheduler is the diagnostics dashboards that should be hosted live at http://localhost:8787/status .  Visit that link and then run the computation again.

You may want to arrange your notebook and this webpage side-by-side on your screen so that you can see both at the same time.

In [None]:
%time _ = largest_delay.compute()

### Exercise

Repeat the groupby computation from the previous notebook, and watch the diagnostics page. What is taking all of the time?

In [None]:
# What was the average departure delay from each airport?
df[~df.Cancelled].groupby('Origin').DepDelay.mean().compute()

---

### New API

The distributed scheduler is more sophisticated than the single machine schedulers.  It comes with more functions to manage data, computing in the background, and more.  The distributed scheduler also has entirely separate documentation

-  http://distributed.readthedocs.io/en/latest/
-  http://distributed.readthedocs.io/en/latest/api.html