# Processing data in batch from a datastore
A major advantage of the automatic organization of data in B-Store is that batch processing becomes very simple. Batch processing is the automated processing and analysis of selected data from the datastore.

A batch process typically goes as follows:
1. Define a batch processor and tell it where your datastore is located
2. Define a pipeline consisting of processors that perform operations on the data
3. Run the process and output the datafiles to a directory on your computer for further analysis

In [1]:
# Import the essential bstore libraries
from bstore import processors, batch

# This is part of Python 3.4 and greater and not part of B-Store
from pathlib import Path

## Before starting: Get the test data
Once again, we will use data inside B-Store's test file repository in this example. Clone or download the repository from https://github.com/kmdouglass/bstore_test_files and point the variable below to *test_experiment/test_experiment_db.h5* within the *bstore_test_files* folder.

In [2]:
dbFile = Path('../../bstore_test_files/test_experiment/test_experiment_db.h5')

# Step one: Define the processing pipeline
A batch processor works by opening each dataset in a B-Store datastore and applying a `processor` to it in sequence. A processor represents one fundamental processing step that can be performed on a localization dataset.

In this example, we'll simply filter out localizations that are poorly localized by pulling only rows from the datastore whose precision values are less than 20 and whose loglikelihood values are 250 or less. The pipeline will consist of a Python list of these two `Filter` processors. The order in which they are applied goes from the first element in the list to the last.

## Current list of processors
At the time of this writing, B-Store provides these built-in processors:

1. **AddColumn** - Adds a single column to a DataFrame and fills it with a default value
2. **CleanUp** - Removes rows containing invalid entries, such as `Inf` or `NaN`
3. **Cluster** - Performs spatial clustering on localizations
4. **ComputeClusterStats** - Computes features of clustered localizations
5. **ConvertHeader** - Changes the names of the columns
6. **FiducialDriftCorrect** - Interactively find fiducial beads to perform drift correction
7. **Filter** - Filter out rows not matching the filter criteria
8. **Merge** - Merge nearby localizations in time and space

In [3]:
# Setup the processors
# uncertainty and loglikelihood are column names
Filter1 = processors.Filter('uncertainty',   '<',   20) # Note the quotation marks ''
Filter2 = processors.Filter('loglikelihood', '<=', 250)

# Create the pipeline; [...] denotes a Python list
pipeline = [
            Filter1,
            Filter2
           ]

# Step two: Setup the batch processor
With the pipeline defined, we can now setup the batch processor. Since our datastore is inside an HDF file, we'll use B-Store's `HDFBatchProcessor` to read the data and apply the pipeline.

When creating the `HDFBatchProcessor`, we need to supply two arguments:

1. `dbFile` - Our B-Store HDF datastore file
2. `pipeline` - The list of processors to apply to the data

The optional arguments to `HDFBatchProcessor` are

1. `outputDirectory` - The full path to a directory to output the results. By default, this is a folder in the same directory as the calling code and is called *processed_data*. If the `outputDirectory` does not exist, it will be automatically created.
2. `searchString` - A string matching one of B-Store's dataset types and that identifies the type of data in the datastore to process. By default this is `locResults`.

In [4]:
bp = batch.HDFBatchProcessor(dbFile, pipeline)

When initialized, the batch processor will open the datastore and locate all the dataset types matching `searchString`. We can investigate the datasets it found through its `datasetList` field.

In [5]:
for ds in bp.datasetList:
    print(ds)

DatasetID(prefix='HeLaL_Control', acqID=1, datasetType='Localizations', attributeOf=None, channelID='A647', dateID=None, posID=(0,), sliceID=None)
DatasetID(prefix='HeLaS_Control', acqID=2, datasetType='Localizations', attributeOf=None, channelID='A647', dateID=None, posID=(0,), sliceID=None)


# Step three: Run the batch processor
Now that the batch processor is setup, we use the `go()` method to automatically apply our pipeline to each dataset.

In [6]:
bp.go()

Output directory does not exist. Creating it...
Created folder /home/kmdouglass/src/bstore/examples/processed_data


# Step four: analyze the results
The batch processor has output its results into the *processed_data* directory. Here's what this contains:

In [7]:
# This is just a Linux command that prints
# all files and folders in a directory structure
%ls -R processed_data/

processed_data/:
[0m[01;34mHeLaL_Control[0m/  [01;34mHeLaS_Control[0m/

processed_data/HeLaL_Control:
[01;34mHeLaL_Control_1[0m/

processed_data/HeLaL_Control/HeLaL_Control_1:
Localizations_ChannelA647_Pos0.csv  Localizations_ChannelA647_Pos0.json

processed_data/HeLaS_Control:
[01;34mHeLaS_Control_2[0m/

processed_data/HeLaS_Control/HeLaS_Control_2:
Localizations_ChannelA647_Pos0.csv  Localizations_ChannelA647_Pos0.json


The output is telling us that a folder was generated for each dataset prefix in the datastore, in this case `HeLaL_Control` and `HeLaS_Control`. Inside each of these folders, another folder was generated for each specific acquisition, here `HeLaL_Control_1` and `HeLaS_Control_2`. Finally, these folders contain two files each. The processed localizations are in .csv files and can be opened in any software package that can process column separated values, like [ThunderSTORM](https://github.com/zitmen/thunderstorm) or even Microsoft Excel. Each .csv file has a corresponding .json file that contains the B-Store dataset ID's, ensuring that each processed dataset can be traced back to its original dataset in the datastore.

# Extra note: batch processing with Python set operations

As noted in [Tutorial 1](https://github.com/kmdouglass/bstore/blob/master/examples/Tutorial%201%20-%20Introduction%20to%20B-Store%20databases.ipynb), you can iterate over Datastores using standard Python set operations. This means that, strictly speaking, the `HDFBatchProcessor` is not entirely necessary. It is provided, however, as a convenience to those not familiar with Python.

If however you are familiar with functional programming styles in Python, you can chain processors together since the output of one can be sent as an argument to the next. These processors can be applied to each dataset by iterating over the Datastore and saving the results if desired.

# Extra note: batch processing on CSV files
In addition to the `HDFBatchProcessor`, B-Store provides a `CSVBatchProcessor` for performing a batch process on .csv files containing localization data. This is useful if you have already pulled data out of a datastore and processed it once, but want to perform additional steps on the processed data.

The `CSVBatchProcessor` works in much the same way as the `HDFBatchProcessor`. Instead of searching a datastore, it searches a directory and sub-directories for all files matching a pattern in its `suffix` argument. The processed files will be placed in the directory contained in the `outputDirectory` argument if `useSameFolder` is set to False. If `useSameFolder = True`, the additionally processed files will be located in the same folder as the originals.

In [8]:
# Define an additional filter to apply to the
# already processed results
newPipeline = [processors.Filter('sigma', '<', 175)]

# Search for all .csv files in the processed_data
# directory
inputDirectory = Path('processed_data/')
suffix         = '.csv'

# Create the CSV batch processor
bpCSV = batch.CSVBatchProcessor(inputDirectory, newPipeline,
                                useSameFolder = True, suffix = suffix)

# Run the CSV batch processor
bpCSV.go()

In [9]:
# Display the new contents of processed_data
%ls -R processed_data/

processed_data/:
[0m[01;34mHeLaL_Control[0m/  [01;34mHeLaS_Control[0m/

processed_data/HeLaL_Control:
[01;34mHeLaL_Control_1[0m/

processed_data/HeLaL_Control/HeLaL_Control_1:
Localizations_ChannelA647_Pos0.csv
Localizations_ChannelA647_Pos0.json
Localizations_ChannelA647_Pos0_processed.csv

processed_data/HeLaS_Control:
[01;34mHeLaS_Control_2[0m/

processed_data/HeLaS_Control/HeLaS_Control_2:
Localizations_ChannelA647_Pos0.csv
Localizations_ChannelA647_Pos0.json
Localizations_ChannelA647_Pos0_processed.csv


Now you can see that there are two additional files ending in \*processed.csv and containing the additionally processed localizations.

# Summary

1. Localization data inside a B-Store HDF datastore can be automatically processed using a `HDFBatchProcessor`.
2. A batch process consists of a list of processors (known as a pipeline) that are sequentially applied to each dataset in the datastore.
3. B-Store comes with a few processors already for performing common computations on the localization data.
4. A batch processor automatically applies the pipeline to the data and saves the results in a structured output directory for further analysis. To do this, call the `go()` method.
5. B-Store supplies a `CSVBatchProcessor` for automatically processing .csv files in multiple directories.

In [10]:
# Clean up the example files
import shutil
shutil.rmtree('processed_data/')