I believe that out-of-core merging will be easier if I use HDF streams to do the data processing. This notebook is an attempt to understand the file type and how it works in Pandas.

In [None]:
%pylab
import DataSTORM.processors as proc
import pandas as pd
from pathlib import Path

# Creating an HDF5 file
I don't currently have an HDF5 file, so I will create one from existing test data.

In [None]:
fileIn = Path('../test-data/Centrioles/FOV_7_noPB_1500mW_10ms_1/FOV_7_noPB_1500mW_10ms_1_MMStack_locResults_DC.dat')
with open(str(fileIn), 'r') as file:
    df = pd.read_csv(file)

In [None]:
df.describe()

To save a DataFrame as an hdf5 file, we use the to_hdf() function:

In [None]:
fileOut = fileIn.parent / Path(fileIn.stem + '.h5')

In [None]:
df.to_hdf(str(fileOut),
          key    = 'localizations',
          format = 'table',
          mode   = 'w',
          data_columns = ['loglikelihood'])

Whether to_hdf() succeeds depends critically on the arguments passed to it. Here's a brief description of the above parameters:

1. **key = 'localizations'** This is the identifier for the table inside the hdf5 store
2. **format = 'table'** This allows for searching the data from inside the store. The alternative and default argument is **'fixed'**, which is faster but not searchable.
3. **mode = 'w'** to_hdf() threw some strange errors until I added this part.
4. **data_columns** = ['loglikelihood'] sets only this column to be searchable. That is, we can query this column only using select operations. Note that column headers with units and spaces are not selectable currently.

# Reading hdf5 files
Let's start by simplying obtaining the keys that identify the datasets.

In [None]:
hdf = pd.HDFStore(str(fileOut), mode = 'r')
for key in hdf.keys():
    print(key)

We have to close the hdf store when we are finished.

In [None]:
hdf.close()

Next, let's see if I can read in specific columns from the store.

In [None]:
hdf = pd.HDFStore(str(fileOut), mode = 'r')
df2 = hdf.select(key   = 'localizations',
                 where = [pd.Term('columns', '=', ['x [nm]', 'y [nm]'])])
hdf.close()

In [None]:
df2.describe()

I can also attempt to read in files using read_hdf().

In [None]:
df3 = pd.read_hdf(str(fileOut),
                  key = 'localizations',
                  columns = ['x [nm]', 'y [nm]'])

In [None]:
df3.describe()

I can also read from a store and filter the inputs at the same time.

In [None]:
hdf = pd.HDFStore(str(fileOut), mode = 'r')
df4 = hdf.select(key     = 'localizations',
                 columns = ['x [nm]', 'y [nm]', 'loglikelihood'],
                 where   = [pd.Term('loglikelihood', '<', 250.0)])
hdf.close()

In [None]:
hdf.close()

In [None]:
df4.describe()

In [None]:
hdf = pd.HDFStore(str(fileOut), mode = 'r')

In [None]:
hdf.close()

In [None]:
import trackpy as tp

In [None]:
tp.PandasHDFStoreBig()