# Dask DataFrames

In the last two sections we built computations with dask.delayed and then ran them on a distributed cluster using dask.distributed.  In this section we use Dask.dataframes to build computations for us in the common case of tabuluar computations.  Dask dataframes look and feel like Pandas dataframes but they run on the same infrastructure that powers dask.delayed (indeed some of dask.dataframes are just built with dask.delayed).

In this notebook we use the same stock data as in notebook 1, but now rather than write for loops we let dask.dataframe construct our computations for us.

In [None]:
import os

import dask
import dask.dataframe as dd

df = dd.read_csv(os.path.join('data', 'stocks', 'GOOG', '*.csv'), parse_dates=['timestamp'])
df

In [None]:
%time df.head()

In [None]:
%time df.high.max().compute()

In [None]:
%%time
%matplotlib inline

high = df.groupby(df.timestamp.dt.round('1d')).high.max()
low = df.groupby(df.timestamp.dt.round('1d')).low.min()
spread = high - low
spread.compute().plot(figsize=(10, 5))

### Persist

In [None]:
from dask.distributed import Client, progress
c = Client('localhost:8786')
c

In [None]:
df = df.persist()
progress(df)

In [None]:
len(df)

In [None]:
df = df.set_index('timestamp', sorted=True).persist()

In [None]:
%time df.close.resample('1d').mean().fillna(method='ffill').compute().plot(figsize=(10, 5))

In [None]:
pdf = df.compute()

In [None]:
%%time
pdf.close.resample('1d').mean().fillna(method='ffill').plot(figsize=(10, 5))