<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


# Schedulers

In the previous notebook we used Dask.delayed to create computations.  By default, these ran on a local thread pool.  Often, this is sufficient, especially when you are bound by NumPy and Pandas routines which release the GIL and when you are using powerful workstation computers with many cores.

However sometimes you may want to execute your code in processes (for Pure Python code that holds onto the GIL), in a single thread (for profiling and debugging) or across a cluster (for larger computations).

In this section we first talk about changing schedulers.  Then we use the dask.distributed scheduler in more depth.

### Previous exercise

One solution to the previous exercise follows below.  We run this same computation using three single-machine schedulers:

1.  `dask.threaded.get         # uses a local threadpool`
2.  `dask.multiprocessing.get  # uses a local process pool`
3.  `dask.async.get_sync       # uses only the main thread (useful for debugging)`

In each case we change the scheduler by providing a `get=` keyword argument like the following:

```python
total.compute(get=dask.multiprocessing.get)
# or 
dask.compute(a, b, get=dask.multiprocessing.get)
```

In [None]:
import dask 
import dask.multiprocessing
import pandas as pd
from glob import glob
import os

In [None]:
%%time

filenames = sorted(glob(os.path.join('data', 'stocks', 'GOOG', '*.csv')))

spreads = []
days = []
for fn in filenames:
    df = dask.delayed(pd.read_csv)(fn, parse_dates=['timestamp'], index_col='timestamp')
    spread = df.high.max() - df.low.min()
    day = df.index[0].date()
    
    spreads.append(spread)
    days.append(day)

In [None]:
%time s, d = dask.compute(spreads, days)  # this uses threads by default

In [None]:
%time s, d = dask.compute(spreads, days, get=dask.multiprocessing.get)  # this uses processes by default

In [None]:
%time s, d = dask.compute(spreads, days, get=dask.async.get_sync)  # This uses a single thread by default

### Profiling

The synchronous scheduler is particularly valuable for debugging and profiling.  

For example, the IPython `%%prun` magic gives us profiling information about which functions take up the most time in our computation.  Try this magic on the computation above with each of the schedulers.  How informative is this magic when running parallel code?

In [None]:
%prun s, d = dask.compute(spreads, days, get=dask.threaded.get)

In [None]:
%prun s, d = dask.compute(spreads, days, get=dask.multiprocessing.get)

In [None]:
%prun s, d = dask.compute(spreads, days, get=dask.async.get_sync)

### Context managers

Sometimes your `.compute` method is buried deeply within code and you can't easily provide a `get=` keyword.  In these cases you can also use the `dask.set_options` context manager.

In [None]:
%%time

with dask.set_options(get=dask.async.get_sync):
    s, d = dask.compute(spreads, days)

### Question:  In what cases would you want to use one scheduler over another?

http://dask.pydata.org/en/latest/scheduler-choice.html

### Distributed Scheduler

The dask.distributed system is composed of a single centralized scheduler and several worker processes.  We can either set these up manually as command line processes or have Dask set them up for us from the notebook

#### Manual

It is good to know how to set things up manually in case you want to try out Dask on a small cluster of your own.  However, if you are unfamiliar with the command line you can safely skip this section and go down to the Automatic section below. We run the `dask-scheduler` process on one machine.

    $ dask-scheduler
    distributed.scheduler - INFO -   Scheduler at:  tcp://127.0.0.1:8786
    distributed.bokeh.application - INFO - Web UI: http://127.0.0.1:8787/status/

The scheduler reports that it is running at 127.0.0.1 on port 8786.  The address 127.0.0.1 is another name for "localhost" or "this machine".  We will need to give this address to the workers and client so you might want to copy it now.

We now run the `dask-worker` process on every machine that we want to use for computation.  For right now this is probably just once on our laptop, but in production this may be on many different machines:

    $ dask-worker 127.0.0.1:8786
    distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:45011
    distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:8786

*Note*: for this tutorial we want to start the dask-worker process in the `dask-workshop/` directory.  This will ensure that the workers Python processes have access to the same data that our notebook process does.

    $ cd dask-workshop/  # navigate to whereever you have started your notebooks
    ~/dask-workshop/ $ dask-worker 127.0.0.1:8786

*Note*: By default the dask-worker command line tool starts a single process with a thread pool with as many threads as you have cores on your computer.  If you are doing mostly GIL-released computations (numpy, pandas, scikit-learn) then this is the right choice.  However if you are doing mostly GIL-bound comptutations (Python code, pandas with text, parsing) then you will want to start the workers with multiple processes and one thread per process

    $ dask-worker 127.0.0.1:8786 --nprocs 8 --nthreads 4
    
You can see more options by asking for help

    $ dask-worker --help
    
When the scheduler and workers are running you can connect to them using a Dask `Client`, giving it the same address of the scheduler that you gave to the worker.

```python
from dask.distributed import Client
client = Client('127.0.0.1:8786')
```

#### Automatic

Starting a single scheduler and worker on the local machine is the common case.  Sometimes using the command line can be annoying.  Dask will set everything up for you if you start a client with no arguments

```python
from dask.distributed import Client
client = Client()
```

If you choose this approach then there is no need to set up a `dask-scheduler` or `dask-worker` process.

You can find more information at the following documentation pages:

- [Quickstart](http://distributed.readthedocs.io/en/latest/quickstart.html)
- [Setup Network](http://distributed.readthedocs.io/en/latest/setup.html)

In [None]:
# Exercise: Start a client 
#           that points to `'127.0.0.1:8786'` if you started a `dask-scheduler` manually 
#           or with no arguments (like `Client()`) if you want Dask to set things up for you.

from dask.distributed import Client



### Diagnostics

One of the main advantages of using the distributed scheduler is the diagnostics dashboards that should be hosted live at http://localhost:8787/status .  Visit that link and then run the computation again.

You may want to arrange your notebook and this webpage side-by-side on your screen so that you can see both at the same time.

In [None]:
%time s, d = dask.compute(spreads, days, get=client.get)  # This uses our "distributed" cluster

### Client takes over by default

Actually, we didn't need to add `get=client.get`.  The distributed scheduler takes over as the default scheduler by default when the Client is instantiated.

In [None]:
%time s, d = dask.compute(spreads, days)  # This used to use threads by default, now it uses dask.distributed

### New API

The distributed scheduler is more sophisticated than the single machine schedulers.  It comes with more functions to manage data, computing in the background, and more.  The distributed scheduler also has entirely separate documentation

-  http://distributed.readthedocs.io/en/latest/
-  http://distributed.readthedocs.io/en/latest/api.html