<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">


# Delayed DataFrames

In two of the previous notebooks we saw two ways to build parallel computations with Dask

1.  Use Dask.delayed to wrap custom code
2.  Use Dask.dataframe to handle large dataframes 

Most non-trivial problems require both.  We often deal with tabular data, but messy situations arise where we need to handle things manually.

In this notebook we use Dask.delayed to load some custom data and then convert these delayed values into a Dask dataframe.  This shows us how to use both together.

In [None]:
import glob
import os
filenames = sorted(glob.glob(os.path.join('data', 'stocks', '*', '*.csv')))

In [None]:
dirname = os.path.join('data', 'messy')
if os.path.exists(dirname):
    import shutil
    shutil.rmtree(dirname)
    
os.mkdir(dirname)

In [None]:
import pandas as pd
import feather

In [None]:
df = pd.read_csv('data/stocks/GOOG/2015-01-10.csv', parse_dates=['timestamp'])

In [None]:
((df.timestamp - df.timestamp.dt.floor('1d')).astype(int)/ 1e9).astype(int)

In [None]:
def convert(fn):
    data, _, symbol, date = fn.split(os.sep)
    date = date.split('.')[0]
    df = pd.read_csv(fn, parse_dates=['timestamp'])
    df['timestamp'] = ((df.timestamp - df.timestamp.dt.floor('1d')).astype(int)/ 1e9).astype(int)
    new_fn = os.path.join(data, 'messy', date, symbol + '.feather')
    if not os.path.exists(os.path.dirname(new_fn)):
        os.mkdir(os.path.dirname(new_fn))
    feather.write_dataframe(df, new_fn)

import dask
import dask.multiprocessing
values = [dask.delayed(convert)(fn) for fn in filenames]

dask.compute(values, get=dask.multiprocessing.get);

### Inpsect data

```
data/messy/2015-01-01
├── AAPL.feather
├── GOOG.feather
├── MSFT.feather
└── YHOO.feather
data/messy/2015-01-02
├── AAPL.feather
├── GOOG.feather
├── MSFT.feather
└── YHOO.feather
data/messy/2015-01-03
├── AAPL.feather
├── GOOG.feather
├── MSFT.feather
└── YHOO.feather
```

In [None]:
import feather
df = feather.read_dataframe(os.path.join('data', 'messy', '2015-01-01', 'GOOG.feather'))

In [None]:
df.head()

### Load sequentially, concat to pandas dataframe

In the code below we:

1.  Load each dataframe into memory
2.  Alter the dataframe to include the symbol name and date in the filename
3.  Concatenate them into a large dataframe

We will eventually want to parallelize this computation using dask.delayed and dask.dataframes

In [None]:
dfs = []
for dir in sorted(glob.glob(os.path.join('data', 'messy', '*'))):
    for fn in sorted(glob.glob(os.path.join(dir, '*'))):
        _, _, date, symbol = fn.split(os.path.sep)
        symbol = symbol[:-len('.feather')]
        date = pd.Timestamp(date)
        df = feather.read_dataframe(fn)
        df['timestamp'] = df.timestamp.astype('m8[s]') + date
        df['symbol'] = symbol
        dfs.append(df)

In [None]:
df = pd.concat(dfs, axis=0)
df.head()

### Use dask.dataframe.from_delayed

We can construct a Dask.dataframe from many delayed functions that produce pandas dataframes.  Each delayed value forms one of the partitions of the final Dataframe.

Consider the following example ...

In [None]:
def f(n):
    return pd.DataFrame({'x': [i for i in range(n)],
                         'y': [i ** 2 for i in range(n)]})

f(5)

In [None]:
lazy_dataframes = [dask.delayed(f)(n) for n in [1, 3, 5, 7]]
lazy_dataframes

In [None]:
import dask.dataframe as dd
df = dd.from_delayed(lazy_dataframes)
df

In [None]:
df.compute()

### Exercise: Delayed + Dataframes

Build a lazy Dask dataframe from the sequential dataframe munging code we had above.  You will have to use dask.delayed to parallelize/lazify the for-loop code from before and then use `dd.from_delayed` to convert these many lazy Pandas dataframes into a dask.dataframe.

*Hint: You may at some point need to rely on [pandas.DataFrame.assign](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.assign.html) to avoid mutating a delayed object*

In [None]:
# Convert this code using dask.delayed

dfs = []
for dir in sorted(glob.glob(os.path.join('data', 'messy', '*'))):
    for fn in sorted(glob.glob(os.path.join(dir, '*'))):
        _, _, date, symbol = fn.split(os.path.sep)
        symbol = symbol[:-len('.feather')]
        date = pd.Timestamp(date)
        df = feather.read_dataframe(fn)
        df['timestamp'] = df.timestamp.astype('m8[s]') + date
        df['symbol'] = symbol
        dfs.append(df)

In [None]:
# Convert delayed values to dask.dataframe


In [None]:
%load solutions/04-delayed-dataframes.py