In addition to the normal BatchProcessor class, there is class called H5BatchProcessor for working with the HDF5 file format. This format allows data to be written in a table format to disk in a fast an efficient manner. More importantly, database-like queries may be made to the file so that data is selectively read back into memory. These features, along with HDF's popularity in data-intensive science makes the HDF format a useful one for storing very large localizations files.

The H5BatchProcessor can take either a .csv file (like the .dat files that the Fang software outputs) or .h5 file as input. It produces an .h5 file as output. Inside the .h5 file, multiple tables may be stored, so that filtered and merged localizations may be kept in the same file as the raw localizations.

The Python Pandas library provides extremely fast functions for writing and reading these files; the DataSTORM library is essentially a wrapper around Pandas to make the Pandas code easier to use.

In [1]:
%pylab
import DataSTORM.processors as proc
import DataSTORM.batch      as bat
import pandas as pd
from pathlib import Path

Using matplotlib backend: Qt4Agg
Populating the interactive namespace from numpy and matplotlib


# Overview of the H5BatchProcessor
The H5BatchProcessor works in a similar manner as BatchProcessor but with a few extra features. It allows the user to set a chunk size so that only a small amount of data is read and processed at time. If the datasets are small, then the whole dataset may be loaded and processed in memory by setting `chunksize = None`.

Pipelines are constructed in a manner similar to Tutorial 2. The one difference is that only Filter and CleanUp are supported when the `chunksize` is something other than None. The reason for this is that out-of-core processing requires special algorithms for processing data on disk.

H5BatchProcessor provides one additional method called `goMerge()`. This method is used to merge datasets that are too large to fit inside memory. The downside to this method is that it takes an extremely long time to merge localization data from the disk; for this reason, it should only be used in extreme cases. For all other cases where the data can fit inside memory, Merging may be performed by the H5BatchProcessor or regular BatchProcessor by placing it in the pipeline like normal.

# Clean up data from a .csv file and save to .h5

In [2]:
cleanup  = proc.CleanUp()
pipeline = [cleanup]

In [3]:
inputDir = Path('../test-data/Centrioles/')
bp = bat.H5BatchProcessor(inputDir,
                          pipeline,
                          suffix        = '_locResults_small.dat', # Look for files ending with these
                          chunksize     = 2e6,         # Number of localizations in a chunk; set to None to load all localizations into memory
                          inputFileType = 'csv',       # Can be either 'csv' or 'h5
                          useSameFolder = True,        # Save results to the same folder as the input datafiles
                          outputKey     = 'processed') # Optional: this identifies the table inside the h5 file

# Run the pipeline on the data
Since the pipeline is just a CleanUp processor, the data will be cleaned up and stored in an h5 file.

In [4]:
bp.go()

# Perform out-of-core merging
This step is only necessary if the data is too large to fit into memory on your machine. It continually reads from and writes to the h5 file on the disk. Because it takes a long time, it is recommended to do merging in the pipeline of the `BatchProcessor.go()` method if possible.

In [5]:
%time bp.goMerge(mergeRadius = 40,
                 tOff        = 1,
                 writeChunks = 10000) # This determines how many trajectories to compute statistics for

Frame 889: 28 trajectories present
CPU times: user 3min 33s, sys: 7.62 s, total: 3min 41s
Wall time: 3min 40s


# Verify the processed and merged data
The original data will be retained in a table, most likely named `processed` unless the user changes this. The merged and processed data is stored in a table named `merged`.

In [6]:
file     = Path('../test-data/Centrioles/FOV_1_noPB_1500mW_10ms_1/FOV_1_noPB_1500mW_10ms_1_MMStack_locResults_small_processed.h5')

procedData = pd.read_hdf(str(file), key = 'processed') # Read the processed--but unmerged--localizations
mergedData = pd.read_hdf(str(file), key = 'merged')    # Read the merged data from the same file

In [7]:
procedData.describe()

Unnamed: 0,x,y,z,frame,precision,photons,bg,loglikelihood,sigma
count,53747.0,53747.0,53747,53747.0,53747.0,53747.0,53747.0,53747.0,53747.0
mean,27697.949764,31415.551339,0,497.493386,198.025954,4453.983217,235.304148,242.865845,138.299916
std,17617.999474,18810.862273,0,236.054761,5349.492832,2776.706977,61.389436,403.084408,17.146917
min,130.56,6.2479,0,100.0,1.144,1.0,147.37,-15.715,87.197
25%,10539.0,13739.5,0,296.0,3.7804,2465.4,211.59,86.1405,129.01
50%,26753.0,29162.0,0,488.0,5.148,3501.1,226.98,115.95,135.73
75%,40249.5,48693.0,0,693.0,6.6572,5539.55,240.87,190.65,143.88
max,64881.0,64961.0,0,940.0,172430.0,39686.0,953.39,9072.8,378.0


In [8]:
mergedData.describe()

Unnamed: 0,x,y,z,loglikelihood,bg,photons,frame,length
count,13157.0,13157.0,13157,13157.0,13157.0,13157.0,13157.0,13157.0
mean,27351.94989,28669.883848,0,346.202615,961.214737,18194.293583,490.56525,4.084974
std,16960.97495,18237.781763,0,510.652373,2848.105135,61815.535802,237.541356,13.140657
min,133.361549,8.103771,0,18.378,152.12,1.0,100.0,1.0
25%,10565.852533,9995.613906,0,91.717,258.93,5073.8,283.0,1.0
50%,26224.0,26895.223398,0,126.46,501.23,9573.8,482.0,2.0
75%,39531.0,40788.864662,0,376.123333,1111.32,20367.2,682.0,5.0
max,64881.0,64961.0,0,8829.866667,180336.79,5152339.2,939.0,840.0


You will notice that the column names have changed. The reason for this is that HDF files can only be queried from disk if they do not contain spaces in their names.

# Save the merged data to a .csv
It may be convenient to save the merged data back to a csv so one may, for example, render the data in ThunderSTORM. To do this, we can simply convert the header back to ThunderSTORM format and save the DataFrame as a csv file.

In [9]:
convert = proc.ConvertHeader(proc.FormatLEB(), proc.FormatThunderSTORM())
newDF   = convert(mergedData)

# Save to the same directory as this notebook
newDF.to_csv('mergedData.csv', index = False) # `index = false` means that the linked particle ID will not be saved