# Lazy Loading from CSV
When a large dataframe is indexed, saving it to CSV in chunks using dask loses index. Even if the individual chunks are sorted by index, it is not enough to read it back and have it indexed. That's unfortunate.

Solution - persist divisions for each file chunk and use them with read_csv.

In [1]:
# setup
import os,sys
import numpy as np
import pandas as pd
from ast import literal_eval

# add workdir to path and import scripts
sys.path.append(os.environ['HOME']+'/github')

# visual
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline
plt.xkcd()

# out-of-core and parallel data management
import dask.dataframe as dd
import dask
dask.config.set({'temporary_directory': '/tmp'})
dask.config.set(scheduler='processes')
from dask.diagnostics import ProgressBar
ProgressBar().register()

%load_ext autoreload
%autoreload 2

In [2]:
# create a pandas dataframe
_df= pd.DataFrame({'id':[i for i in range(100000)],
                   'value':[np.random.randint(1000) for i in range(100000)]})

# turn to dask
_dd = dd.from_pandas(_df,npartitions=4)

# set index
_dd.set_index('id')

# save to csv
_dd.to_csv('./dask_partitions_example_*.csv',index=False)

# save divisions
with open('./divisions.index','w') as _file:
    _file.write('{}'.format(_dd.divisions))

[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s


In [3]:
# load back from CSV and set index again (because it was lost in CSV) - no optimization
_dd2 = dd.read_csv('./dask_partitions_example_*.csv')
%timeit _dd2.set_index('id',sorted=True)

[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s
[########################################] | 100

In [4]:
# load back from CSV - using divisions

# read divisions
with open('divisions.index','r') as _file:
    _div = literal_eval(_file.read())

_dd2 = dd.read_csv('./dask_partitions_example_*.csv')
%timeit _dd2.set_index('id',sorted=True,divisions=_div)

941 µs ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Conclusion: save divisions!