<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


# Dask DataFrames

We finished the last section by building a parallel dataframe computation over a directory of CSV files using Dask.delayed.  In this section we use Dask.dataframes to build computations for us in the common case of tabular computations.  Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers dask.delayed (indeed many dask.dataframe functions are built using dask.delayed).

In this notebook we use the same stock data as in notebook 01, but now rather than write for loops we let dask.dataframe construct our computations for us.  The `dask.dataframe.read_csv` function can take a globstring like `"data/stocks/GOOG/*.csv"` and build parallel computations on all of our data at once.

In [None]:
import os

import dask
import dask.dataframe as dd
%matplotlib inline

df = dd.read_csv(os.path.join('data', 'stocks', 'GOOG', '*.csv'), parse_dates=['timestamp'])
df

In [None]:
df.head()

In [None]:
df.tail()

We compute the maximum of the `high` column.  With Dask.delayed we could create this computation as follows:

```python
maxes = []
for fn in filenames:
    df = dask.delayed(pd.read_csv)(fn)
    maxes.append(df.high.max())
    
final_max = dask.delayed(max)(maxes)
final_max.compute()
```

Now we just use the normal Pandas syntax as follows:

In [None]:
%time df.high.max().compute()

This writes the dask.delayed computation for us and then runs it.  

Some things to note:

1.  As with dask.delayed, we need to call `.compute()` when we're done.  Up until this point everything is lazy.
2.  Dask will delete intermediate results (like the full pandas dataframe for each file) as soon as possible.
    -  This lets us handle datasets that are larger than memory
    -  This means that repeated computations will have to load all of the data in each time (run the code above again, is it faster or slower than you would expect?)

### Exercises

In this section we do a few trivial dask dataframe computations.  If you are comfortable with Pandas then these should be familiar.  You will have to think about when to call `compute`.

-  How many rows are in our dataset?
-  Read the first few rows of the timestamp column 
-  Use the `.dt.round` method to round the timestamp column to days
-  Get the high value for each day by grouping by the result from above and computing the maximum of the high column per group
-  Compute the daily high-low spread.  This is exactly the result from the final exercise in the dask.delayed notebook
-  Plot the resuling Pandas DataFrame

Lets do the same computation we did in the previous section, but now with a few lines of Pandas-ish code

In [None]:
# Read the first few rows of the timestamp column


In [None]:
# Use the `.dt.round('1d')` method to round the timestamp column to days
# Show the first few rows to make sure it works well


In [None]:
# Get the high value for each day by grouping by the result from above 
# and computing the maximum of the high column per group



In [None]:
# Compute the daily high-low spread


In [None]:
# Plot the result
_.plot()

In [None]:
%load solutions/03-dataframe-spread.py