# Building a simple datastore from a small experiment
In this first tutorial we'll take a small experiment which includes raw localizations, widefield images, and metadata and build them into a database. The purpose of doing so is to provide a compact, well-organized representation of single molecule localization microscopy (SMLM) data which faciltaties high content analysis and reproducibility.

The datastore will exist within an [HDF](https://www.hdfgroup.org/) file. The organization of the data inside the file will be handled by B-Store.

In [1]:
# Import the essential bstore libraries
from bstore import database, parsers

# This is part of Python 3.4 and greater and not part of B-Store
from pathlib import Path

## Before starting: Get the test data
You can get the test data for this tutorial from the B-Store test repository at https://github.com/kmdouglass/bstore_test_files. Clone or download the files and change the filename below to point to the folder *parsers_test_files/SimpleParser* within this repository.

In [2]:
searchDirectory = Path('../../bstore_test_files/parsers_test_files/SimpleParser/') # ../ means go up one directory level

*SimpleParser* contains a few folders containing data from an imaging experiment performed on HeLa cells using STORM. The raw STORM localizations are in the files matching the pattern \*.csv. Metadata for the localizations are in JSON format and stored in files matching the \*.txt pattern. Before each STORM image, widefield images were captured and saved in an TIFF format. The naming patterns for these files is \*.tif.

# Step one: Create a parser to read the datasets
In this step, we'll create a parser that can read the files that are stored inside the test data directory and convert them into a format that's more suitable for automated organization and retrieval. One of the default parsers that comes with B-Store and that we'll use in this tutorial is called `SimpleParser`. This parser transforms filenames of the format *prefix_acqID.fileExtension* into DatasetIDs. *prefix* is a descriptive name given to a dataset, such as *HeLa_Cells* or *treatment* and *acqID* is an integer uniquely identifying the field of view.

Because every lab acquires and computes localizations differently, you can use a more customizable parser known as the `PositionParser`, or even write your own in Python code and store it in B-Store's plugins directory: `~/.bstore/bsplugins`. (Note that on Windows `~` becomes `%USERPROFILE%`).

In [3]:
# Create the parser
parser = parsers.SimpleParser()

And that's it! Of course, this step is easy if a parser already exists for your data.

# Step two: Create the empty datastore object
The `HDFDatastore` object is what B-Store uses to build a datastore inside an HDF file. All `HDFDatastore` objects are essentially sets of `DatasetID`s, with some additional functionality to make it easy to get and put data from and into the datastore. The `Parser`'s job is to assign unique `DatasetID`s to your files based on their naming pattern.

This type of design feature, where data and metadata is structured in a certain way as it passes into and out of a database, is known as an interface. The advantage is that you can structure your data however you want on either side of the interface so long as it can be translated into the right format. The format is defined by the `DatasetID` object named `HDFDatastore.dsID`.

When we create the datastore, we specify a path to the file where the file will be stored. Note that no file is created until data is actually put into it.

In [4]:
# The path is relative to this notebook.
# Altnernatively, you could send a Path object
# instead of a string to HDFDatabase constructor.
dsName = 'myFirstDatastore.h5'
myDS   = database.HDFDatastore(dsName)

# Step three: Run a test build of the datastore
Now comes the fun part. We build the datastore by using the HDFDatastore's `build()` method. To do this, we need to send a few required arguments to the method. These are:

1. `parser` - The parser we created to interpret the data files
2. `searchDirectory` - The parent directory containing files and subdirectories with all the experimental data

We also need to specify each type of data we want to include. First we register the types of datasets we want to work with like this:

In [5]:
import bstore.config
bstore.config.__Registered_DatasetTypes__ = ['Localizations', 'LocMetadata', 'WidefieldImage']

`Localizations`, `LocMetadata`, and `WidefieldImage` are three built-in types of datasets. A list of dataset types and their code may be found here: https://github.com/kmdouglass/bstore/tree/master/bstore/datasetTypes

Once the dataset types are registered, we need to tell the build process what files correspond to what dataset types. To do this we will pass a dict called filenameStrings to the `build()` method.

```
filenameStrings = {'Localizations' :  '.csv',
                   'LocMetadata'   :  '.txt',
                   'WidefieldImage' : '.tif'}
```

In this example, localizations are saved in .csv files. If there are special naming patterns to your files, you can use wildcards to identify your files. For example, if your localization files follow the pattern **prefix**\_locs\_**acqID**.csv, then you can pass locs\*.csv instead of .csv above to better specify the files.

Finally, there is a boolean argument named `dryRun`. If you set this to True, the build method won't actually create the database. It will however return a structure that tells you what datasets were successfully parsed and capable of insertion into the database. By default, `dryRun` is set to False.

Let's go ahead and set `build()`'s arguments and do a dry run of the build.

In [6]:
# Note that the default values for locMetadataString and widefieldImageString
# will work in this example
myDS.build(parser, searchDirectory,
           filenameStrings = {'Localizations' :  '.csv',
                              'LocMetadata'   :  '.txt',
                              'WidefieldImage' : '.tif'},
           dryRun = True)

6 files were successfully parsed.


Unnamed: 0_level_0,Unnamed: 1_level_0,datasetType,attributeOf,channelID,dateID,posID,sliceID
prefix,acqID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
HeLaL_Control,1,WidefieldImage,,,,,
HeLaL_Control,1,Localizations,,,,,
HeLaL_Control,1,LocMetadata,Localizations,,,,
HeLaS_Control,2,WidefieldImage,,,,,
HeLaS_Control,2,Localizations,,,,,
HeLaS_Control,2,LocMetadata,Localizations,,,,


The above table contains all the datasets that the `HDFDatastore` found in the `searchDirectory` and is sorted by the acqusition's prefix and ID number. Let's go through these results to understand what they are telling us.

## prefix and acqID
The `prefix` is the descriptive name given to a dataset. In this example, it contains the cell type (HeLaS) and the conditions (Control). The prefix can be anything you want and is required for insertion into the database. The table is telling us that two different conditions were imaged, and for these conditions there was one acquisition.

The `acqID` number is an integer that identifies an acquisition and is also required. An acquisition is simply a collection of datasets containing, for example, localizations, metadata, and possibly widefield images of a single field of view. The set of all acquisitions with the same `prefix` form an acqusition group.

You can also see that the acqID need not start at one, since the first acqID in the HeLaS_Control group is 2.

## datasetType
The `datasetType` is also a required ID. The `datasetType` tells the datastore what type of data it is looking at during the build operation so that it knows how to store it.

Currently, `datasetType` supports three options:

1. Localizations - Tabulated localization data in a raw text format (can be comma separated, tab-separated, etc.)
2. LocMetadata - Textual metadata describing the localizations (currently only JSON is supported)
3. WidefieldImage - A single widefield image of the field of view (.tif and .OME.TIFF is supported)
4. FiducialTracks - Tabulated raw text data on localizations from fiducial markers
5. AverageFiducial - An average over many fiducial tracks, also in tabulated form

## attributeOf

Datasets that describe other datasets have an `attributeOf` field. Because `LocMetadata` describes `Localizations`, you can see that `Localizations` is listed in the corresponding entry.

## channelID, dateID, posID, and sliceID
These fields are optional and specify the fluorescence channel, date of the acqusition, position, and axial slice of a field of view, respectively. They serve to more precisely identify datasets in complex acquisitions.

The channel can be any string you want, such as `A647`.

The dateID is given in a format like YYYY-MM-DD.

The position ID usually follows a format like `(0,)`, which is a single integer identifying the position corresponding to this dataset. This allows the user to specify different positions on a sample that were imaged within the same acquisition. It can also take the form of a two-element tuple like `(x,y)` if desired.

The slice ID is simply an integer.

# Perform the real database build
Now that we've verified everything going into the database, we can build it by detting `dryRun` to False.

In [7]:
myDS.build(parser, searchDirectory,
           filenameStrings = {'Localizations' :  '.csv',
                              'LocMetadata'   :  '.txt',
                              'WidefieldImage' : '.tif'},
           dryRun = False)

6 files were successfully parsed.


Unnamed: 0_level_0,Unnamed: 1_level_0,datasetType,attributeOf,channelID,dateID,posID,sliceID
prefix,acqID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
HeLaL_Control,1,WidefieldImage,,,,,
HeLaL_Control,1,Localizations,,,,,
HeLaL_Control,1,LocMetadata,Localizations,,,,
HeLaS_Control,2,WidefieldImage,,,,,
HeLaS_Control,2,Localizations,,,,,
HeLaS_Control,2,LocMetadata,Localizations,,,,


Let's verify that the HDF file was created in the same directory as this notebook.

In [8]:
Path('./myFirstDatastore.h5').exists()

True

# Pulling data from the database
Now that data has been placed inside our database, how do we get it out?

We can use the `HDFDatastore.get()` method to get the data for a specific dataset. The `get()` method accepts a DatasetID (`HDFDatastore.dsID`) that specifies the dataset's ID's and returns an object allowing access to the data.

The order of IDs is important; it is:

1. prefix
2. acqID
3. datasetType
4. attributeOf
5. channelID
6. dateID
7. posID
8. sliceID

In [9]:
# Define the dataset ID's
dsID = myDS.dsID('HeLaL_Control', 1, 'Localizations', None, None, None, None, None)

# Extract the dataset from the database
myData = myDS.get(dsID)

Finally, we can access myData's `data` field to actually access the data in the database. Here, we compute some summary statistics and display the first few rows of the localization data.

In [10]:
# describe() is a Pandas DataFrame method that displays
# summary statistics
myData.data.describe()

Unnamed: 0,x,y,z,frame,uncertainty,intensity,offset,loglikelihood,sigma
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,8994.581818,59467.181818,0.0,50.0,5.993009,10992.2,720.831818,1847.315455,179.28
std,1170.696295,1687.184034,0.0,0.0,3.013617,8734.24533,367.812667,3631.486533,39.753501
min,6770.0,56713.0,0.0,50.0,1.0787,3107.8,270.24,243.08,111.56
25%,8024.15,58228.5,0.0,50.0,4.3144,7599.9,508.74,554.72,158.095
50%,9163.2,59647.0,0.0,50.0,6.5072,8408.1,641.58,643.07,198.22
75%,9866.6,60286.0,0.0,50.0,7.18055,11132.6,922.995,1064.22,201.995
max,10350.0,62858.0,0.0,50.0,10.883,35038.0,1346.0,12727.0,218.79


In [11]:
# head() is a Pandas DataFrame method that displays
# the first five rows
myData.data.head()

Unnamed: 0,x,y,z,frame,uncertainty,intensity,offset,loglikelihood,sigma
0,6770.0,59386,0,50,9.5138,4386.6,270.24,425.92,218.79
1,7958.1,59762,0,50,6.7329,8310.3,562.65,619.47,199.5
2,7840.8,60819,0,50,2.1987,15671.0,1261.1,1691.4,119.47
3,8090.2,59801,0,50,7.6282,6952.3,642.53,506.19,206.46
4,9010.3,59647,0,50,6.5814,8408.1,684.29,821.24,197.9


# Set-like operations on HDFDatastores

HDFDatastores support many standard Python operations for sets.

In [12]:
# Number of datasets
len(myDS)

6

In [13]:
# Iteration
for ds in myDS:
    print(ds)

DatasetID(prefix='HeLaL_Control', acqID=1, datasetType='WidefieldImage', attributeOf=None, channelID=None, dateID=None, posID=None, sliceID=None)
DatasetID(prefix='HeLaS_Control', acqID=2, datasetType='WidefieldImage', attributeOf=None, channelID=None, dateID=None, posID=None, sliceID=None)
DatasetID(prefix='HeLaL_Control', acqID=1, datasetType='Localizations', attributeOf=None, channelID=None, dateID=None, posID=None, sliceID=None)
DatasetID(prefix='HeLaS_Control', acqID=2, datasetType='Localizations', attributeOf=None, channelID=None, dateID=None, posID=None, sliceID=None)
DatasetID(prefix='HeLaL_Control', acqID=1, datasetType='LocMetadata', attributeOf='Localizations', channelID=None, dateID=None, posID=None, sliceID=None)
DatasetID(prefix='HeLaS_Control', acqID=2, datasetType='LocMetadata', attributeOf='Localizations', channelID=None, dateID=None, posID=None, sliceID=None)


In [14]:
# Filtering and list comprehensions
filteredSets = [ds for ds in myDS if ds.prefix == 'HeLaL_Control' and ds.datasetType == 'WidefieldImage']

print(filteredSets)

[DatasetID(prefix='HeLaL_Control', acqID=1, datasetType='WidefieldImage', attributeOf=None, channelID=None, dateID=None, posID=None, sliceID=None)]


In [15]:
# Integer-based indexing
myDS[2]

DatasetID(prefix='HeLaL_Control', acqID=1, datasetType='Localizations', attributeOf=None, channelID=None, dateID=None, posID=None, sliceID=None)

# Summary

1. A B-Store datastore is an organized collection of raw data and metadata from an SMLM experiment.
2. B-Store provides a built-in datastore known as `HDFDatastore` that stores the data in an HDF file.
3. A datastore requires a `Parser` to convert your files into the format that the datastore knows how to handle.
4. B-Store organizes datasets into acquisition groups that are defined by a **prefix** and **acquisition ID**. A single acquisition is defined by a **datasetType** and possibly a **channel ID**, **dateID**, **position ID**, and a **slice ID**.
5. You can perform a dry run before building to verify what files will go into the database using `build(dryRun = True)`.
6. After building the datastore, data may be retrieved using the `get()` method.
7. HDFDatastores support many standard Python operations for sets.

In [16]:
# Delete the database example file
import os
os.remove('myFirstDatastore.h5')