<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


# Dask DataFrames

In the last two sections we built computations with dask.delayed and then ran them on a distributed cluster using dask.distributed.  In this section we use Dask.dataframes to build computations for us in the common case of tabular computations.  Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers dask.delayed (indeed many dask.dataframe functions are built using dask.delayed).

In this notebook we use the same stock data as in notebook 1, but now rather than write for loops we let dask.dataframe construct our computations for us.  The `dask.dataframe.read_csv` function can take a globstring like `"data/stocks/GOOG/*.csv"` and build parallel computations on all of our data at once.

In [None]:
import os

import dask
import dask.dataframe as dd
%matplotlib inline

df = dd.read_csv(os.path.join('data', 'stocks', 'GOOG', '*.csv'), parse_dates=['timestamp'])
df

In [None]:
%time df.head()

In [None]:
%time df.high.max().compute()

### Connect to the distributed scheduler

If you want to use the diagnostic webpage while performing exercises in this section you may want to create a Dask Client and connect it to your local cluster.

In [None]:
from dask.distributed import Client

# client = Client('127.0.0.1:8786')  # if you are running dask-scheduler
# client = Client()  # if you want Dask to set things up for you

### Exercises

In this section we do a few trivial dask dataframe computations.  If you are familiar with Pandas then these should be familiar.  You will have to think about when to call `compute`.

-  How many rows are in our dataset?
-  Read the first few rows of the timestamp column 
-  Use the `.dt.round` method to round the timestamp column to days
-  Get the high value for each day by grouping by the result from above and computing the maximum of the high column per group
-  Compute the daily high-low spread.  This is exactly the result from the final exercise in the dask.delayed notebook
-  Plot the resuling Pandas DataFrame

Lets do the same computation we did in the previous section, but now with a few lines of Pandas-ish code

In [None]:
# Read the first few rows of the timestamp column


In [None]:
# Use the `.dt.round('1d')` method to round the timestamp column to days
# Show the first few rows to make sure it works well


In [None]:
# Get the high value for each day by grouping by the result from above 
# and computing the maximum of the high column per group



In [None]:
# Compute the daily high-low spread


In [None]:
# Plot the result
_.plot()

In [None]:
%load solutions/03-dataframe-spread.py

### Persist data in distributed memory

Every time we run an operation like `df.high.max().compute()` we read through our dataset from disk.  This can be slow, especially because we're reading data from CSV.  We usually have two options to make this faster:

1.  Persist relevant data in memory, either on our computer or on a cluster
2.  Use a faster on-disk format, like HDF5 or Parquet

In this section we persist our data in memory.  On a single machine this is often done by doing a bit of pre-processing and data reduction with dask dataframe and then `compute`-ing to a Pandas dataframe and using Pandas in the future.  

```python
df = dd.read_csv(...)
df = df[df.account == 1234]  # filter down to smaller dataset
pdf = df.compute()  # convert to pandas
pdf ... # continue with familiar Pandas workflows
```

However on a distributed cluster when even our cleaned data is too large we still can't use Pandas.  In this case we ask Dask to persist data in memory with the `dask.persist` function.  This is what we'll do today.  This will help us to understand when data is lazy and when it is computing.

You can trigger computations using the persist method:

    x = x.persist()

or the dask.persist function for multiple inputs:

    x, y = dask.persist(x, y)

### Exercise

Persist the dataframe into memory

In [None]:
# Persist df


In [None]:
# Time computing len after waiting for persist to finish.  
# How much faster is it?



### Exercise

Copy-paste the Daily High-Low Spread plot from above.  How much faster is it?  What is taking all of the time?

In [None]:
%%time
# copy-paste code from Daily High-Low Spread plot above

### Partitions

One Dask.dataframe is composed of several Pandas dataframes.  The organization of these dataframes can significantly impact performance.  In this section we discuss two common factors that commonly impact performance:

1.  The number of Pandas dataframes can affect overhead.  If the dataframes are too small then Dask might spend more time deciding what to do than Pandas spends actually doing it.  Ideally computations should take 100's of milliseconds.

2.  If we know how the dataframes are sorted then certain operations become much faster

### Number of partitions and repartitioning

When we read in our data from CSV files we got one Pandas dataframe for each day.  Look at the metadata below to determine how many partitions we have.  Each "partition" is a Pandas dataframe.

In [None]:
df

**Question:** Roughly how large is each partition?

There are a few ways to answer this:

1.  Look at the diagnostic dashboard to see how much memory is being used.  Divide this by the number of partitions.
2.  Use the [.map_partitions()](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method along with the `pandas.DataFrame.memory_usage().sum()` function to determine how many bytes each partition consumes.

We see that our partitions in our dataframe are somewhat small.  This is because the data for every day isn't very large.  This means that Dask may spend more time scheduling computations than Pandas actually spends running them.  We would like to partition our data so that our individual Pandas dataframes are roughly ~100MB each.

### Reduce the number of partitions with repartition

We can bring partitions together with the [.repartition](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) method.  Be sure to persist the dataframe afterwards so that we don't do the repartition step over and over again.  About 20 partitions is probably a good number.

### Compare timings

Use the diagnostic dashboard and the `%time` magic to compare the speed of some operations that we did above.  How have things improved?

### Sorted Index column

*This section doesn't have any exercises.  Just follow along.*

Many dataframe operations like loc-indexing, groupby-apply, and joins are *much* faster on a sorted index.  For example, if we want to get data for a particular day of data it *really* helps to know where that day is, otherwise we need to search over all of our data.

The Pandas model gives us a sorted index column.  Dask.dataframe copies this model, and it remembers the min and max values of every partition's index.

By default, our data doesn't have an index.

In [None]:
df.head()

So if we search for a particular day it takes a while because it has to pass through all of the data.

In [None]:
%time df[df.timestamp.dt.round('1d') == '2015-05-05'].compute()

However if we set the timestamp column as the index then this operation can be much much faster.

In [None]:
%%time
df = df.set_index('timestamp')
df

In [None]:
%time df.loc['2015-05-05'].compute()

Additionally this lets us do traditional Pandas timeseries functionality.

In [None]:
%%time 
(df.close
   .resample('1d')
   .mean()
   .fillna(method='ffill')
   .compute()
   .plot(figsize=(10, 5)))