<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


# Schedulers and Efficiency

In the previous two notebooks we used Dask.delayed and Dask.dataframe to create computations.  By default, these ran on a local thread pool on our personal machines.  Often, this is sufficient, especially when you are bound by NumPy and Pandas routines which release the GIL and when you are using powerful workstation computers with many cores.

However sometimes you may want to execute your code in processes (for Pure Python code that holds onto the GIL), in a single thread (for profiling and debugging) or across a cluster (for larger computations).

In this section we first talk about changing schedulers.  Then we use the dask.distributed scheduler in more depth.  Finally we redo some of our dataframe computations, but with an eye to efficiency.

### Previous exercise

One solution to the previous exercise follows below.  We run this same computation using three single-machine schedulers:

1.  `dask.threaded.get         # uses a local threadpool`
2.  `dask.multiprocessing.get  # uses a local process pool`
3.  `dask.async.get_sync       # uses only the main thread (useful for debugging)`

In each case we change the scheduler by providing a `get=` keyword argument like the following:

```python
total.compute(get=dask.multiprocessing.get)
# or 
dask.compute(a, b, get=dask.multiprocessing.get)
```

In [None]:
import dask 
import dask.multiprocessing
import dask.dataframe as dd
import pandas as pd
from glob import glob
import os

In [None]:
%%time
df = dd.read_csv(os.path.join('data', 'stocks', 'GOOG', '*.csv'), parse_dates=['timestamp'])

high = df.groupby(df.timestamp.dt.round('1d')).high.max()
low = df.groupby(df.timestamp.dt.round('1d')).low.min()
spread = high - low

In [None]:
%time _ = spread.compute()  # this uses threads by default

In [None]:
%time _ = spread.compute(get=dask.multiprocessing.get)  # this uses processes by default

In [None]:
%time _ = spread.compute(get=dask.async.get_sync)  # This uses a single thread by default

### Profiling

*You should skip this section if you are running low on time*.

The synchronous scheduler is particularly valuable for debugging and profiling.  

For example, the IPython `%%prun` magic gives us profiling information about which functions take up the most time in our computation.  Try this magic on the computation above with each of the schedulers.  How informative is this magic when running parallel code?

In [None]:
%prun _ = spread.compute(get=dask.threaded.get)

In [None]:
%prun _ = spread.compute(get=dask.multiprocessing.get)

In [None]:
%prun _ = spread.compute(get=dask.async.get_sync)

### Question:  In what cases would you want to use one scheduler over another?

http://dask.pydata.org/en/latest/scheduler-choice.html

### Distributed Scheduler

The dask.distributed system is composed of a single centralized scheduler and several worker processes.  We can either set these up manually as command line processes or have Dask set them up for us from the notebook.  


#### Automatic

Starting a single scheduler and worker on the local machine is the common case.  Sometimes using the command line can be annoying.  Dask will set everything up for you if you start a client with no arguments

```python
from dask.distributed import Client
client = Client()
```

If you choose this approach then there is no need to set up a `dask-scheduler` or `dask-worker` process as described below.


#### Manual

It is good to know how to set things up manually in case you want to try out Dask on a small cluster of your own.  However, if you are unfamiliar with the command line you can safely skip this section and go down to the Automatic section below. We run the `dask-scheduler` process on one machine.

    $ dask-scheduler
    distributed.scheduler - INFO -   Scheduler at:  tcp://127.0.0.1:8786
    distributed.bokeh.application - INFO - Web UI: http://127.0.0.1:8787/status/

The scheduler reports that it is running at 127.0.0.1 on port 8786.  The address 127.0.0.1 is another name for "localhost" or "this machine".  We will need to give this address to the workers and client so you might want to copy it now.

We now run the `dask-worker` process on every machine that we want to use for computation.  For right now this is probably just once on our laptop, but in production this may be on many different machines:

    $ dask-worker 127.0.0.1:8786
    distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:45011
    distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:8786

*Note*: for this tutorial we want to start the dask-worker process in the `dask-workshop/` directory.  This will ensure that the workers Python processes have access to the same data that our notebook process does.

    $ cd dask-workshop/  # navigate to whereever you have started your notebooks
    ~/dask-workshop/ $ dask-worker 127.0.0.1:8786

*Note*: By default the dask-worker command line tool starts a single process with a thread pool with as many threads as you have cores on your computer.  If you are doing mostly GIL-released computations (numpy, pandas, scikit-learn) then this is the right choice.  However if you are doing mostly GIL-bound comptutations (Python code, pandas with text, parsing) then you will want to start the workers with multiple processes and one thread per process

    $ dask-worker 127.0.0.1:8786 --nprocs 8 --nthreads 4
    
You can see more options by asking for help

    $ dask-worker --help
    
When the scheduler and workers are running you can connect to them using a Dask `Client`, giving it the same address of the scheduler that you gave to the worker.

```python
from dask.distributed import Client
client = Client('127.0.0.1:8786')
```
### More information

You can find more information at the following documentation pages:

- [Quickstart](http://distributed.readthedocs.io/en/latest/quickstart.html)
- [Setup Network](http://distributed.readthedocs.io/en/latest/setup.html)

In [None]:
# Exercise: Start a client 
#           that points to `'127.0.0.1:8786'` if you started a `dask-scheduler` manually 
#           or with no arguments (like `Client()`) if you want Dask to set things up for you.

from dask.distributed import Client



### Diagnostics

One of the main advantages of using the distributed scheduler is the diagnostics dashboards that should be hosted live at http://localhost:8787/status .  Visit that link and then run the computation again.

You may want to arrange your notebook and this webpage side-by-side on your screen so that you can see both at the same time.

In [None]:
%time _ = spread.compute(get=client.get)  # This uses our "distributed" cluster

### Client takes over by default

Actually, we didn't need to add `get=client.get`.  The distributed scheduler takes over as the default scheduler by default when the Client is instantiated.

In [None]:
%time _ = spread.compute()  # This used to use threads by default, now it uses dask.distributed

### New API

The distributed scheduler is more sophisticated than the single machine schedulers.  It comes with more functions to manage data, computing in the background, and more.  The distributed scheduler also has entirely separate documentation

-  http://distributed.readthedocs.io/en/latest/
-  http://distributed.readthedocs.io/en/latest/api.html

## Efficiency

In this section we combine the distributed scheduler with our dask.dataframe exercises to learn about how to make our computations more efficient.  We will cover the following topics:

1.  Persist common intermediate results in memory with `persist`
2.  Reduce per-task overhead by repartitioning our datafarmes

### Persist data in distributed memory

Every time we run an operation like `df.high.max().compute()` we read through our dataset from disk.  This can be slow, especially because we're reading data from CSV.  We usually have two options to make this faster:

1.  Persist relevant data in memory, either on our computer or on a cluster
2.  Use a faster on-disk format, like HDF5 or Parquet

In this section we persist our data in memory.  On a single machine this is often done by doing a bit of pre-processing and data reduction with dask dataframe and then `compute`-ing to a Pandas dataframe and using Pandas in the future.  

```python
df = dd.read_csv(...)
df = df[df.account == 1234]  # filter down to smaller dataset
pdf = df.compute()  # convert to pandas
pdf ... # continue with familiar Pandas workflows
```

However on a distributed cluster when even our cleaned data is too large we still can't use Pandas.  In this case we ask Dask to persist data in memory with the `dask.persist` function.  This is what we'll do today.  This will help us to understand when data is lazy and when it is computing.

You can trigger computations using the persist method:

    x = x.persist()

or the dask.persist function for multiple inputs:

    x, y = dask.persist(x, y)

### Exercise

Persist the dataframe into memory.

-  After it has persisted how long does it take to compute `df.high.max()`?
-  Looking at the plots in the [diagnostic web page](http://localhost:8787/status), what is taking up most of the time?  (You can over over rectangles to see what function they represent)

In [None]:
%time df.high.max().compute()

In [None]:
df = # TODO: persist dataframe in memory

### Exercise

Copy-paste the Daily High-Low Spread plot from above.  How much faster is it?  What is taking all of the time?

### Partitions

One Dask.dataframe is composed of several Pandas dataframes.  The organization of these dataframes can significantly impact performance.  In this section we discuss two common factors that commonly impact performance:

1.  The number of Pandas dataframes can affect overhead.  If the dataframes are too small then Dask might spend more time deciding what to do than Pandas spends actually doing it.  Ideally computations should take 100's of milliseconds.

2.  If we know how the dataframes are sorted then certain operations become much faster

### Number of partitions and repartitioning

When we read in our data from CSV files we got one Pandas dataframe for each day.  Look at the metadata below to determine how many partitions we have.  Each "partition" is a Pandas dataframe.

In [None]:
df

**Question:** Roughly how large is each partition?

There are a few ways to answer this:

1.  Look at the diagnostic dashboard to see how much memory is being used.  Divide this by the number of partitions.
2.  Use the [.map_partitions()](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method along with the `pandas.DataFrame.memory_usage().sum()` function to determine how many bytes each partition consumes.

We see that our partitions in our dataframe are somewhat small.  This is because the data for every day isn't very large.  This means that Dask may spend more time scheduling computations than Pandas actually spends running them.  We would like to partition our data so that our individual Pandas dataframes are roughly ~100MB each.

### Reduce the number of partitions with repartition

We can bring partitions together with the [.repartition](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) method.  Be sure to persist the dataframe afterwards so that we don't do the repartition step over and over again.  About 20 partitions is probably a good number.

### Compare timings

Use the diagnostic dashboard and the `%time` magic to compare the speed of some operations that we did above.  How have things improved?

### Sorted Index column

*This section doesn't have any exercises.  Just follow along.*

Many dataframe operations like loc-indexing, groupby-apply, and joins are *much* faster on a sorted index.  For example, if we want to get data for a particular day of data it *really* helps to know where that day is, otherwise we need to search over all of our data.

The Pandas model gives us a sorted index column.  Dask.dataframe copies this model, and it remembers the min and max values of every partition's index.

By default, our data doesn't have an index.

In [None]:
df.head()

So if we search for a particular day it takes a while because it has to pass through all of the data.

In [None]:
%time df[df.timestamp.dt.round('1d') == '2015-05-05'].compute()

However if we set the timestamp column as the index then this operation can be much much faster.

In [None]:
%%time
df = df.set_index('timestamp')
df

In [None]:
%time df.loc['2015-05-05'].compute()

Additionally this lets us do traditional Pandas timeseries functionality.

In [None]:
%%time 
(df.close
   .resample('1d')
   .mean()
   .fillna(method='ffill')
   .compute()
   .plot(figsize=(10, 5)))