# Writing custom parsers
B-Store was designed to work with your data by not enforcing strict rules about file formats. This means, for example, that you are not required to follow a certain column naming convention or to use .csv files when generating your raw data.

While this gives you a lot of flexibility when acquiring your data in the lab, it does come at a cost: you must write your own parser to translate your files into a format that can be organized by B-Store.

B-Store comes with a built-in parser known as a `SimpleParser` to provide out-of-the-box functionality for simple datasets. In this tutorial, we'll write the SimpleParser from scratch to demonstrate how you may write your own parsers for B-Store.

## The logic of B-Store
B-Store was designed to take localization data, widefield images, and metadata and convert them into a format that is easily stored for both human and machine interpretation. This logic is illustrated below:

<img src="../design/dataset_logic.png" width = 50%/>

The role of the `Parser` is take these raw datasets and assign to them a descriptive name (known as a `prefix`) that identifies datasets that should be grouped together, such as grouping data from controls and treatments into separate groups. Within these groups, which are known as acquisition groups, each dataset is identified by a number known as the `acqID` and the type of data it contains, the `datasetType`. Finally, there are a number of other fields that may identify the dataset if more precise delimitation between datasets is required.

When provided with a file, a `Parser` is required to specify the following fields:

- `acqID` - a unique integer for a given prefix
- `prefix` - a string that gives a descriptive name to the dataset
- `datasetType` - one of the strings listed in the `__Types_Of_Atoms__` variable in *config.py*; at the time of writing, these are 'locResults', 'locMetadata', or 'widefieldImage'

Additionally, the `Parser` must provide a way to access the actual data contained in a file. Depending on the `datasetType`, the data from a file is represented internally as one of these data types after loading from memory:

- `locResults` - [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)
- `locMetadata` - [JSON](http://www.json.org/) string-value pairs
- `widefieldImage` - 2D [Numpy](http://www.numpy.org/) array

# The `Parser` interface
The reason that B-Store needs this ID information is that organization in the database can be automated only if the data matches the database interface. In B-Store, this interface is known as a `DatabaseAtom`; an actual realization of a dataset is known as a `Dataset`. In software engineering terms, the `Dataset` class *implements* the `DatabaseAtom` interface, which just means that a `Dataset` knows how to communicate with a database and vice versa.

To ease its creation, a parser must also implement an interface known as a `Parser`. The `Parser` interface is simply a list of functions that a Python class must implement to be called a `Parser`. Let's start by looking at the code for this interface:

In [1]:
# Import B-Store's parsers module
from bstore import parsers

# Used to retrieve the code
import inspect

In [2]:
print(inspect.getsource(parsers.Parser))

class Parser(metaclass = ABCMeta):
    """Translates files to machine-readable data structures with acq. info.
    
    Attributes
    ----------
    acqID       : int
        The number identifying the Multi-D acquisition for a given prefix name.
    channelID   : str
        The color channel associated with the dataset.
    dateID      : str
        The date of the acquistion in the format YYYY-mm-dd.
    posID       : (int,) or (int, int)
        The position identifier. It is a single element tuple if positions were
        manually set; otherwise, it's a 2-tuple indicating the x and y
        identifiers.
    prefix      : str
        The descriptive name given to the dataset by the user.
    sliceID     : int
        The number identifying the z-axis slice of the dataset.
    datasetType : str
        The type of data contained in the dataset. Can be one of 'locResults',
        'locMetadata', or 'widefieldImage'.
       
    """
    def __init__(self, prefix, acqID, datasetType

Examining the code above, we can see that a `Parser` has two functions:

- `__init__()` : the constructor that assigns the class fields
- `getBasicInfo()` : returns a dictionary with the Parser's information

Furthermore, there are a few functions that are preceded by `abstractproperty` or `abstractmethod` that don't actually do anything (their body's contents only contain the word `pass`). These are the functions and properties that our custom `Parser` must define to work with our data. They are:

- `data` - contains the actual data from a file
- `getDatabaseAtom()` - returns a DatabaseAtom instance that can be put inside a B-Store database
- `parseFilename` - generates the DatabaseAtom ID fields from a file or filename

# Designing the `SimpleParser`

## File naming conventions
For the sake of this tutorial, let's suppose that our acquisition software produces files that follow this naming convention:

- **prefix_acqID.csv** : `locResults` come in .csv files that with a common name, followed by an underscore, and then an integer identifier. For example, HeLa_2.csv
- **prefix_acqID.txt** : `locMetadata` is found in .txt files with prefixes and acquisition ID's that match their corresponding localization data
- **prefix_acqID.tif** : `widefieldImage`'s are found in tif files that also match the corresponding the localization data.

## SimpleParser inputs and outputs
Our `SimpleParser` will be relatively, well, simple to convert these files into a format that B-Store can organize. This will hopefully give you the main idea about how you may write your own and provide a base class for doing so.

The parser's constructor will take no arguments. It's main function, `parseFilename()` will take a string as input that represents a file's name and another string representing the `datasetType` of the file. This function will set the ID fields of the `Parser` and also tell the Parser how to read the data.

Let's write an outline of this class following this design that doesn't actually do anything.

```python
class SimpleParser(Parser):
    """A simple parser for extracting acquisition information.
    
    The SimpleParser converts files of the format prefix_acqID.* into
    DatabaseAtoms for insertion into a database. * may represent .csv files
    (for locResults), .json (for locMetadata), and .tif (for widefieldImages).
    
    """
    def __init__(self):
        pass
    
    def getDatabaseAtom(self):
        pass
    
    def parseFilename(self):
        pass
    
    @property
    def data(self):
        pass 
```

With the skeleton above we have all the functions and the `data` property that are required by the interface, plus a constructor named `__init__()`. The problem is, there's no actual functionality at the moment.

### `parseFilename()`
Most of the work done by the Parser is the `parseFilename()` function. This function reads a filename and then fills in the appropriate fields of `Parser` parent class, like `acqID`, `prefix`, etc. The function should also take an argument that we'll call `datasetType` that tells it what kind of dataset it's looking at. The function then handles each type of dataset differently.

Let's add this argument and another named `filename`, then begin to flesh out the function.

```python
def parseFilename(self, filename, datasetType = 'locResults'):
    """Converts a filename into a DatabaseAtom.
        
    Parameters
    ----------
    filename : str or Path
        A string or pathlib Path object containing the dataset's filename.
    dsType   : str
        The type of the dataset being parsed. This tells the Parser
        how to interpret the data.
            
    """
    pass # Don't do anything yet
```

First, we'll save the filename which contains the full path to the file to a private variable for later use.

```python
# Save the full path to the file for later.
# If filename is already a Path object, this does nothing.
self._fullPath = pathlib.Path(filename) 
```

Next, we need to account for the fact that the input filename can be either a string or a pathlib `Path` object. To do this, we convert a `Path` to a string using the `str()` function. This is done because we'll use string manipulations later to parse the filename.

```python
# Convert Path objects to strings if Path is supplied
if isinstance(filename, pathlib.PurePath):
    filename = str(filename.name)
```

We use `Path`'s parent, a `PurePath` object, because its output is the same regardless of the user's operating system. The `.name` property of a path is simply the file's name without the parent folders.

Now let's look again briefly at the naming convention of our data. All of our files follow the rule **prefix_acqID.xxx**. This means that the file type--.csv, .txt, or .tif--already tells us the dataset type. The first part of the filename will always tell us the `prefix`, which can be anything, and the last part will always be an underscore followed by an integer `acqID`.

We can easily extract this information with Python's built-in string manipulation tools and the *os.path* library.

In [3]:
from os.path import splitext

# Example
filename = 'path/to/HeLa_Control_7.csv'

# Remove the '.csv'
print('Remove the file type: ' + splitext(filename)[0])

# Remove any parent folders
print('Remove the file type and parent folders: ' + splitext(filename)[0].split('/')[-1])

# This works if there are no parents folders, too
print(splitext('HeLa_Control_7.csv')[0].split('/')[-1])

Remove the file type: path/to/HeLa_Control_7
Remove the file type and parent folders: HeLa_Control_7
HeLa_Control_7


The `prefix` and `acqID` values are easy to get. We simply split the string at the last underscore and take the part before it as the `prefix` and the part after as the `acqID`. Python's `rsplit()` function does this for us. Finally, we convert the `acqID` from a string to an integer.

In [4]:
# Isolate the root filename
rootName = splitext(filename)[0].split('/')[-1]

# Split the string at the last underscore
prefix, acqID = rootName.rsplit('_', 1)
acqID = int(acqID) # Convert the string to an integer

print('prefix is: {:s}'.format(prefix))
print('acqID is: {:d}'.format(acqID))

prefix is: HeLa_Control
acqID is: 7


The `datasetType` was already an input to the `parseFilename()` function, so we don't need to do anything to get it from the filename. We will however add one additional part to the code to check whether the input datasetType is actually one that is recognized by B-Store. We do this by checking whether the string is inside a list of valid types called `typesOfAtoms` that is a property of the B-Store database.

```python
if datasetType not in database.typesOfAtoms:
    raise DatasetError(datasetType)
```

Now we have all of the ID's that parser is designed to interpret: `prefix`, `acqID`, and `datasetType`. The other ID's, which are `channelID`, `dateID`, `posID`, and `sliceID`, are optional and can be implemented in your own parser. The SimpleParser will not assign values to them.

We finish the function by calling the constructor of the parent of `SimpleParser`, which is known as `Parser`. This will properly assign the values extracted by the filename. Finally, we set the class field `_initialized` to True. `_initialized` will appear later in the constructor defintion, `__init__()`.

```python
super(SimpleParser, self).__init__(prefix, acqID, datasetType)
self._initialized = True
```

The full `parseFilename` function for `SimpleParser` looks like what follows below. The whole code block is wrapped inside a try...except statement in case an error is raised during parsing. If an error is raised, the `self._initialized` field is set to False.

```python
    def parseFilename(self, filename, datasetType = 'locResults'):
        """Converts a filename into a DatabaseAtom.
        
        Parameters
        ----------
        filename      : str or Path
            A string or pathlib Path object containing the dataset's filename.
        datasetType   : str
            The type of the dataset being parsed. This tells the Parser
            how to interpret the data.
            
        """
        # Check for a valid datasetType
        if datasetType not in database.typesOfAtoms:
            raise DatasetError(datasetType)        
        
        try:
            # Save the full path to the file for later.
            # If filename is already a Path object, this does nothing.
            self._fullPath = pathlib.Path(filename)        

            # Convert Path objects to strings if Path is supplied
            if isinstance(filename, pathlib.PurePath):
                filename = str(filename.name)

            # Remove file type ending and any parent folders
            # Example: 'path/to/HeLa_Control_7.csv' becomes 'HeLa_Control_7'
            rootName = splitext(filename)[0].split('/')[-1]

            # Extract the prefix and acqID
            prefix, acqID = rootName.rsplit('_', 1)
            acqID = int(acqID)

            # Initialize the Parser
            super(SimpleParser, self).__init__(prefix, acqID, datasetType)
            self._initialized = True
        except:
            self._initialized = False
            print('Error: File could not be parsed.', sys.exc_info()[0])
            raise
```

#### Parsing Optional ID's

If you do want to set properties like `channelID`, you can add them as optional arguments to the call to the constructor. This would look like:

```python
super(SimpleParser, self).__init__(prefix, acqID, datasetType, channelID = extractedID)
```

where `extractedID` contains whatever channel identifier you extracted.

## The `data` property
The next most important addition to our `SimpleParser` skeleton is the `data` property. This will tell the `SimpleParser` how to read the data that is in a file and format them for insertion into the database. Again, we can rely on a lot of built-in and 3rd party libraries in Python for most of this part.

Even though the `data` field is accessed like a class property, it's actually defined as a function. This is achieved by inserting the [`@property` decorator](https://docs.python.org/3.5/library/functions.html#property) before the function definition.

```python
@property
def data(self):
    pass
```

There are currently three possible values for the `datasetType`, so this function has to define how each of these types are read. Let's setup a a series of `if...elif` statements to handle each type of dataset.

```python
def data(self):
    if self.datasetType == 'locResults':
        pass

    elif self.datasetType == 'locMetadata':
        pass

    elif self.datasetType == 'widefieldImage':
        pass
        
```

### `locResults`
Localization results are stored on disk in .csv files. In B-Store memory, however, localization results are stored in a data type known as a [Pandas DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). To read the data from the .csv file into a DataFrame, we can use the `read_csv()` function supplied by Pandas.

```python
if self.datasetType == 'locResults':
    # Loading the csv file when data() is called reduces the
    # chance that large DataFrames do not needlessly
    # remain in memory.
    with open(str(self._fullPath), 'r') as file:            
        df = pd.read_csv(file)
        return df
```

This opens the file whose path is stored in `self._fullPath` and reads in the data using Pandas's `read_csv()`. (Note that we've assumed you have imported Pandas using `import pandas as pd` near the top of the file.) We don't require anything special to import the csv files, but if your file contains, for example, comments or a delimiter other than a comma, you can specify these in [`read_csv()`'s optional arguments](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). Finally, the DataFrame is returned from the function

### `locMetadata`
B-Store stores metadata as [JSON strings](http://www.json.org/). We will assume that the data in our text files are already valid JSON strings and simply read them with Python's built-in `json` library. (You will need to insert the line `import json` near the top of the code file.)

```python
elif self.datasetType == 'locMetadata':
    # Read the txt file and convert it to a JSON string.
    with open(str(self._fullPath), 'r') as file:
        metadata = json.load(file)
        return metadata
```

This part is much the same as for the localization results except that we use `json.load()` to read the text file instead of `pd.read_csv()`.

### `widefieldImage`
The ability to read .tif files is provided by [matplotlib](http://matplotlib.org/), specifically the `pyplot.imread()` function. Append the line `from matplotlib.pyplot import imread` to the top of the code file to use this function.

```python
elif self.datasetType == 'widefieldImage':
    # Load the image data only when called
    return imread(str(self._fullPath))
```

### The full function defintion
```python
@property
def data(self):
    if self.datasetType == 'locResults':
        # Loading the csv file when data() is called reduces the
        # chance that large DataFrames do not needlessly
        # remain in memory.
        with open(str(self._fullPath), 'r') as file:            
            df = pd.read_csv(file)
            return df

    elif self.datasetType == 'locMetadata':
        # Read the txt file and convert it to a JSON string.
        with open(str(self._fullPath), 'r') as file:
            metadata = json.load(file)
            return metadata

    elif self.datasetType == 'widefieldImage':
        # Load the image data only when called
        return imread(str(self._fullPath))
```

## `getDatabaseAtom()`
The purpose of this function is simply to return an object implementing the `DatabaseAtom` interface that is built from the parsed identifiers and data. B-Store only knows how to place objects that implement this interface into a database. (Remember that B-Store provides a `Dataset` object that already implements this interface for you.)

First, we check that the parser has already been initialized by checking whether `_initialized` is False. Next, we call the superclass function `getBasicInfo()` to return a dictionary containing all the identifiers that the parser has interpreted from the file. Then, we use these identifiers to initialize a new `Dataset` and finally return it.

```python
def getDatabaseAtom(self):
    """Returns an object capable of insertion into a SMLM database.

    Returns 
    -------
    dba : DatabaseAtom
        One atomic unit for insertion into the database.

    """
    if not self._initialized:
        raise ParserNotInitializedError('Parser not initialized.')
    
    ids = self.getBasicInfo()
    dba = database.Dataset(ids['prefix'], ids['acqID'], ids['datasetType'],
                           self.data, channelID = ids['channelID'],
                           dateID = ids['dateID'], posID = ids['posID'], 
                           sliceID = ids['sliceID'])
    return dba
```

The above code requires placement of the import statement `from bstore import database` at the top of the code file to access the `Dataset` class. The `Dataset` constructor requires the `prefix`, `acqID`, and `datasetType` ID's and the `self.data` field in that order. The remaining ID's are optional.

## `__init__()`
The final function to define is `SimpleParser`'s constructor, which in Python is called `__init__()`. We will only need one line that sets that `_initialized` property to False.

```python
def __init__(self):
   self._initialized = False
```

# The `SimpleParser` class definition
Here is the final, full definition for our `SimpleParser` class.

```python
class SimpleParser(Parser):
    """A simple parser for extracting acquisition information.
    
    The SimpleParser converts files of the format prefix_acqID.* into
    DatabaseAtoms for insertion into a database. * may represent .csv files
    (for locResults), .json (for locMetadata), and .tif (for widefieldImages).
    
    """
    def __init__(self):
        self._initialized = False
    
    def getDatabaseAtom(self):
        """Returns an object capable of insertion into a SMLM database.
        
        Returns 
        -------
        dba : DatabaseAtom
            One atomic unit for insertion into the database.
        
        """
        if not self._initialized:
            raise ParserNotInitializedError('Parser not initialized.')
        
        ids = self.getBasicInfo()
        dba = database.Dataset(ids['prefix'], ids['acqID'], ids['datasetType'],
                               self.data, channelID = ids['channelID'],
                               dateID = ids['dateID'], posID = ids['posID'], 
                               sliceID = ids['sliceID'])
        return dba
    
    def parseFilename(self, filename, datasetType = 'locResults'):
        """Converts a filename into a DatabaseAtom.
        
        Parameters
        ----------
        filename      : str or Path
            A string or pathlib Path object containing the dataset's filename.
        datasetType   : str
            The type of the dataset being parsed. This tells the Parser
            how to interpret the data.
            
        """
        # Check for a valid datasetType
        if datasetType not in database.typesOfAtoms:
            raise DatasetError(datasetType)        
        try:
            # Save the full path to the file for later.
            # If filename is already a Path object, this does nothing.
            self._fullPath = pathlib.Path(filename)        

            # Convert Path objects to strings if Path is supplied
            if isinstance(filename, pathlib.PurePath):
                filename = str(filename.name)

            # Remove file type ending and any parent folders
            # Example: 'path/to/HeLa_Control_7.csv' becomes 'HeLa_Control_7'
            rootName = splitext(filename)[0].split('/')[-1]

            # Extract the prefix and acqID
            prefix, acqID = rootName.rsplit('_', 1)
            acqID = int(acqID)

            # Initialize the Parser
            super(SimpleParser, self).__init__(prefix, acqID, datasetType)
            self._initialized = True
        except:
            self._initialized = False
            print('Error: File could not be parsed.', sys.exc_info()[0])
            raise
    
    @property
    def data(self):
        if self.datasetType == 'locResults':
            # Loading the csv file when data() is called reduces the
            # chance that large DataFrames do not needlessly
            # remain in memory.
            with open(str(self._fullPath), 'r') as file:            
                df = pd.read_csv(file)
                return df
                
        elif self.datasetType == 'locMetadata':
            # Read the txt file and convert it to a JSON string.
            with open(str(self._fullPath), 'r') as file:
                metadata = json.load(file)
                return metadata
            
        elif self.datasetType == 'widefieldImage':
            # Load the image data only when called
            return imread(str(self._fullPath))
```

# Example
For this example, you can use the test data in the [bstore_test_files](https://github.com/kmdouglass/bstore_test_files). Download the files from Git using the link and change the path below to point to *parsers_test_files/SimpleParsers* on your machine.

In [5]:
from pathlib import Path

# Specify the test dataset
pathToFiles = Path('../../bstore_test_files/parsers_test_files/SimpleParser/')

In [6]:
# Create the SimpleParser
sp = parsers.SimpleParser()

# Specify a file to parse
file = pathToFiles / Path('HeLaL_Control_1.csv')

# Parse this file
sp.parseFilename(file, datasetType = 'locResults')

# Summarize the localization data
sp.data.describe()

Unnamed: 0,x,y,z,frame,uncertainty,intensity,offset,loglikelihood,sigma
count,11.0,11.0,11,11,11.0,11.0,11.0,11.0,11.0
mean,8994.581818,59467.181818,0,50,5.993009,10992.2,720.831818,1847.315455,179.28
std,1170.696295,1687.184034,0,0,3.013617,8734.24533,367.812667,3631.486533,39.753501
min,6770.0,56713.0,0,50,1.0787,3107.8,270.24,243.08,111.56
25%,8024.15,58228.5,0,50,4.3144,7599.9,508.74,554.72,158.095
50%,9163.2,59647.0,0,50,6.5072,8408.1,641.58,643.07,198.22
75%,9866.6,60286.0,0,50,7.18055,11132.6,922.995,1064.22,201.995
max,10350.0,62858.0,0,50,10.883,35038.0,1346.0,12727.0,218.79


In [7]:
# Return a Dataset that can be inserted into a B-Store database
ds = sp.getDatabaseAtom()
print(ds.prefix)
print(ds.acqID)
ds.data.describe()

HeLaL_Control
1


Unnamed: 0,x,y,z,frame,uncertainty,intensity,offset,loglikelihood,sigma
count,11.0,11.0,11,11,11.0,11.0,11.0,11.0,11.0
mean,8994.581818,59467.181818,0,50,5.993009,10992.2,720.831818,1847.315455,179.28
std,1170.696295,1687.184034,0,0,3.013617,8734.24533,367.812667,3631.486533,39.753501
min,6770.0,56713.0,0,50,1.0787,3107.8,270.24,243.08,111.56
25%,8024.15,58228.5,0,50,4.3144,7599.9,508.74,554.72,158.095
50%,9163.2,59647.0,0,50,6.5072,8408.1,641.58,643.07,198.22
75%,9866.6,60286.0,0,50,7.18055,11132.6,922.995,1064.22,201.995
max,10350.0,62858.0,0,50,10.883,35038.0,1346.0,12727.0,218.79


# Summary

- A `Parser` reads raw data files and converts them into a format for insertion into a B-Store database.
- A `SimpleParser` is built-in into B-Store already.
- The `SimpleParser` knows how to read files of the format **prefix**\_**acqID**.filetype
- When writing a `Parser`, you need to specify at least three functions from the `Parser` interface: `parseFilename()`, `getDatabaseAtom()`, and the `data` property.
- `parseFilename()` knows how to extract B-Store identifiers from files.
- `getDatabaseAtom()` returns a `Dataset` object, which implements the `DatabaseAtom` interface.
- `data` uses the `@property` decorator and tells the `SimpleParser` how to read the data in the files.