# Building a simple database from a small experiment
In this first tutorial we'll take a small experiment which includes raw localizations, widefield images, and metadata and build them into a database. The purpose of doing so is to provide a compact, well-organized representation of single molecule localization microscopy (SMLM) data which faciltaties high content analysis and reproducibility.

The database will exist within an [HDF](https://www.hdfgroup.org/) file. The organization of the data inside the file will be handled by B-Store.

In [1]:
# Import the essential bstore libraries
from bstore import database, parsers

# This is part of Python 3.4 and greater and not part of B-Store
from pathlib import Path

## Before starting: Get the test data
You can get the test data for this tutorial from the B-Store test repository at https://github.com/kmdouglass/bstore_test_files. Clone or download the files and change the filename below to point to the folder *test_experiment_2* within this repository.

In [2]:
dataDirectory = Path('../../bstore_test_files/test_experiment_2/') # ../ means go up one directory level

*test_experiment_2* contains a few folders containing data from an imaging experiment performed on HeLa cells using STORM. The raw STORM localizations are in the files matching the pattern \*locResults_processed.csv. Metadata for the localizations are in JSON format and stored in files matching the \*locMetadata.json pattern. Before each STORM image, widefield images were captured and saved in an [OME.TIFF](https://www.openmicroscopy.org/site/support/ome-model/ome-tiff/) format. The naming patterns for these files is \*WFN_MMStack_Pos0.ome where N is an integer that relates the widefield image to the corresponding localization dataset.

# Step one: Create a parser to read the datasets
In this step, we'll create a parser that can read the files that are stored inside the test data directory and convert them into a format that's more suitable for automated organization and retrieval. The default parser that comes with B-Store is called `MMParser` and is short for Micro-Manager parser. This is the parser that we use to read datasets that were generated by Micro-Manager and our own localization computing software.

Because every lab acquires and computes localizations differently, you will likely need to modify or write your own `Parser` (more on this in a later tutorial).

In [3]:
# Create the parser
parser = parsers.MMParser()

And that's it! Of course, this step is easy if a parser already exists for your data.

We're also ignorning some optional arguments inside the `MMParser()` constructor, but we'll get to those in a later example.

# Step two: Create the empty database object
The `HDFDatabase` object is what B-Store uses to build a database inside an HDF file. It is a specific type of a more general object known simply as a B-Store `Database`. All `Database` objects know how to handle data in only one special type of format. The `Parser`'s job is to convert your raw acquisition data into this format.

This type of design feature, where data must be structured in a certain way as it passes into and out of a database, is known as an interface. The advantage of the interface is that you can structure your data however you want on either side of the interface so long as it can be translated into the right format. For now, we won't worry about what this format, but we will come back to it in a later tutorial.

When we create the object, we specify a path to the file where the information will be stored. Note that no file is created until data is actually put into the database.

In [4]:
# The path is relative to this notebook.
# Altnernatively, you could send a Path object
# instead of a string to HDFDatabase constructor.
dbName = 'myFirstDatabase.h5'
myDB   = database.HDFDatabase(dbName)

# Step three: Run a test build of the database
Now comes the fun part. We build the database by using the HDFDatabase's `build()` method. To do this, we need to send a few required arguments to the method. These are:

1. `parser` - The parser we created to interpret the data files
2. `searchDirectory` - The parent directory containing subdirectories with all the experimental data

There are also a few optional arguments whose defaults we will override to match our file naming patterns. These optional arguments are

1. `locResultsString` - A string at the end of all raw localization file names, including the file type
2. `locMetadataString` - Same as above, but for metadata associated with the localization files
3. `widefieldImageString` - A string at the end of of the file names of any widefield images in the directory

Finally, there is a boolean argument named `dryRun`. If you set this to True, the build method won't actually create the database. It will however return a structure that tells you what datasets were successfully parsed and capable of insertion into the database. By default, `dryRun` is set to False.

Let's go ahead and set `build()`'s arguments and do a dry run of the build.

In [5]:
# Note that the default values for locMetadataString and widefieldImageString
# will work in this example
myDB.build(parser, dataDirectory,
           locResultsString = 'locResults_processed.csv',
           dryRun = True)

16 files were successfully parsed.


Unnamed: 0_level_0,Unnamed: 1_level_0,channelID,datasetType,posID,sliceID
prefix,acqID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HeLaS_Control_IFFISH,1,A647,locResults,"(0,)",
HeLaS_Control_IFFISH,1,A647,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,1,A750,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,1,A647,locMetadata,"(0,)",
HeLaS_Control_IFFISH,2,A647,locResults,"(0,)",
HeLaS_Control_IFFISH,2,A647,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,2,A750,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,2,A647,locMetadata,"(0,)",
HeLaS_shTRF2_IFFISH,1,A647,locResults,"(0,)",
HeLaS_shTRF2_IFFISH,1,A647,widefieldImage,"(0,)",


The above table contains all the datasets that the `HDFDatabase` found in the `searchDirectory` and is sorted by the acqusition's prefix and ID number. Let's go through these results to understand what they are telling us.

## prefix and acqID
The `prefix` is the descriptive name given to a dataset. In this example, it contains the cell type (HeLa S), the conditions (either Control or shTRF2), and the labeling strategy (IF-FISH). The prefix can be anything you want and is required for insertion into the database. The table is telling us that two different conditions were imaged, and for these conditions there were two acquisitions.

The `acqID` number is an integer that identifies an acquisition and is also required. An acquisition is simply a collection of datasets containing localizations, metadata, and possibly widefield images of a single field of view. The set of all acquisitions with the same `prefix` form an acqusition group.

## datasetType
The `datasetType` is also a required property. The `datasetType` tells the database what type of data it is looking at during the build operation so that it knows how to store it.

Currently, `datasetType` supports three options:

1. locResults - Tabulated localization data
2. locMetadata - Textual metadata describing the localizations
3. widefieldImage - A single widefield image of the field of view

## channelID, posID, and sliceID
These fields are optional and specify the fluorescence channel, position, and axial slice of a field of view, respectively. They serve to more precisely identify datasets in complex acquisitions.

`A647` and `A750` denote AlexaFluor 647 and AlexaFluor 750, respectively, which were the fluorophores being imaged in these datasets. In the above example, all the localizations were taken in the `A647` channel, but two widefield images were taken for each acquisition: one in the `A647` channel and one in the `A750` channel.

`(0,)` is single integer identifying the position corresponding to this dataset. This allows the user to specify different positions on a sample that were imaged within the same acquisition. It can also take the form of a two-element tuple like `(x,y)` if desired.

Finally, you can see that no axial slice is specified in these datasets.

# Perform the real database build
Now that we've verified everything going into the database, we can build it by detting `dryRun` to False.

In [6]:
myDB.build(parser, dataDirectory,
           locResultsString = 'locResults_processed.csv',
           dryRun = False)

16 files were successfully parsed.


Unnamed: 0_level_0,Unnamed: 1_level_0,channelID,datasetType,posID,sliceID
prefix,acqID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HeLaS_Control_IFFISH,1,A647,locResults,"(0,)",
HeLaS_Control_IFFISH,1,A647,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,1,A750,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,1,A647,locMetadata,"(0,)",
HeLaS_Control_IFFISH,2,A647,locResults,"(0,)",
HeLaS_Control_IFFISH,2,A647,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,2,A750,widefieldImage,"(0,)",
HeLaS_Control_IFFISH,2,A647,locMetadata,"(0,)",
HeLaS_shTRF2_IFFISH,1,A647,locResults,"(0,)",
HeLaS_shTRF2_IFFISH,1,A647,widefieldImage,"(0,)",


Let's verify that the HDF file was created in the same directory as this notebook.

In [7]:
Path('./myFirstDatabase.h5').exists()

True

# Pulling data from the database
Now that data has been placed inside our database, how do we get it out?

We can use the `HDFDatabase.get()` method to get the data for a specific dataset. The `get()` method accepts a dictionary that specifies the dataset's ID's and returns an object allowing access to the data.

In [8]:
# Define the dataset ID's
dsID = {'acqID' : 1,
        'channelID'   : 'A647',
        'datasetType' : 'locResults',
        'posID'       : (0,),
        'prefix'      : 'HeLaS_Control_IFFISH',
        'sliceID'     : None,
        'datasetType' : 'locResults'}

# Extract the dataset from the database
myData = myDB.get(dsID)

Finally, we can access myData's `data` field to actually access the data in the database. Here, we compute some summary statistics and display the first few rows of the localization data.

In [9]:
# describe() is a Pandas DataFrame method that displays
# summary statistics
myData.data.describe()

Unnamed: 0,x,y,z,frame,precision,photons,background,loglikelihood,sigma
count,764176.0,764176.0,764176,764176.0,764176.0,764176.0,764176.0,764176.0,764176.0
mean,52831.876706,51255.740393,0,3776.210047,4940354000.0,3692.273819,175.29648,319.384749,150.740367
std,25779.302131,27747.772305,0,4075.476684,4318717000000.0,2851.963762,173.242586,1775.875229,33.140342
min,45.232,0.36656,0,50.0,0.55033,1.0,43.116,-46.002,54.0
25%,31991.0,25948.0,0,290.0,4.8352,1792.0,74.854,102.01,127.5
50%,55862.0,55390.0,0,1992.0,6.1971,2559.1,103.41,153.02,139.61
75%,70287.0,72941.0,0,6600.0,7.2866,4798.025,207.48,329.15,164.99
max,100110.0,100110.0,0,13660.0,3775300000000000.0,94281.0,3358.2,390250.0,378.0


In [10]:
# head() is a Pandas DataFrame method that displays
# the first five rows
myData.data.head()

Unnamed: 0,x,y,z,frame,precision,photons,background,loglikelihood,sigma
0,626.16,41046,0,50,19.562,920.15,918.75,606.4,74.177
1,416.36,45301,0,50,9.649,7248.6,395.12,1073.7,239.48
2,278.86,39048,0,50,4.4805,7715.4,483.98,334.81,166.65
3,457.47,37180,0,50,5.4137,4483.7,295.36,196.39,155.9
4,336.16,40588,0,50,8.4281,8422.9,625.27,670.59,213.58


# Summary

1. A B-Store database is an organized collection of raw data and metadata from an SMLM experiment.
2. B-Store provides a built-in database known as `HDFDatabase` that stores the data in an HDF file.
3. A database requires a `Parser` to convert your files into the format that the database knows how to handle.
4. B-Store organizes datasets into acquisition groups that are defined by a **prefix** and **acquisition ID**. A single acquisition is defined by a **dataset type** and possibly a **channel ID**, **position ID**, and a **slice ID**.
5. You can perform a dry run before building to verify what files will go into the database using `build(dryRun = True)`.
6. After building the database, data may be retrieved using the `get()` method

In [11]:
# Delete the database example file
import os
os.remove('myFirstDatabase.h5')