# Writing custom parsers
B-Store was designed to work with your data by not enforcing strict rules about file formats. This means, for example, that you are not required to follow a certain column naming convention or to use .csv files when generating your raw data.

While this gives you a lot of flexibility when acquiring your data in the lab, it does come at a cost: you might need to write your own parser or dataset type.

B-Store comes with two built-in parsers known as a `SimpleParser` and `PositionParser` to provide out-of-the-box functionality for simple datasets. In this tutorial, we'll write the SimpleParser from scratch to demonstrate how you may write your own parsers for B-Store.

## The logic of B-Store
B-Store was designed to take localization data, widefield images, metadata, and more and convert them into a format that is easily stored for both human and machine interpretation. This logic is illustrated below:

<img src="../images/dataset_logic.png" width = 50%/>

The role of the `Parser` is take these raw datasets and assign to them a descriptive name (known as a `prefix`) that identifies datasets that should be grouped together, such as grouping data from controls and treatments into separate groups. Within these groups, which are known as acquisition groups, each dataset is identified by a number known as the `acqID` and the type of data it contains, the `datasetType`. Finally, there are a number of other fields that may identify the dataset if more precise delimitation between datasets is required.

When provided with a file, a `Parser` is required to specify the following fields:

- `acqID` - a unique integer for a given prefix
- `prefix` - a string that gives a descriptive name to the dataset
- `datasetType` - this is actually specified by the user, but used by the `Parser` to know how to read the data from a file

# The `Parser` interface
The reason that B-Store needs this ID information is that organization in the datastore can be automated only if the data matches the datastore interface. In B-Store, this interface is known as a `DatasetID`.

To ease its creation, a parser must also implement an interface known as a `Parser`. The `Parser` interface is simply a list of functions that a Python class must implement to be called a `Parser`. Let's start by looking at the code for this interface:

In [1]:
# Import B-Store's parsers module
from bstore import parsers

# Used to retrieve the code
import inspect

In [2]:
print(inspect.getsource(parsers.Parser))

class Parser(metaclass = ABCMeta):
    """Translates SMLM files to machine-readable data structures.
    
    Attributes
    ----------
    dataset  : Dataset
        A Dataset object for insertion into a B-Store Datastore.
    requiresConfig : bool
        Does parser require configuration before use? This is primarily
        used by the GUI to determine whether the parser has attributes that
        are set by its __init__() method or must be set before parsing files.
       
    """
    def __init__(self):
        # Holds a parsed dataset.
        self._dataset = None
    
    @property
    def dataset(self):
        if self._dataset:
            return self._dataset
        else:
            raise ParserNotInitializedError('Error: There is currently no'
                                            'parsed dataset to return.')
        
    @dataset.setter
    def dataset(self, ds):
        self._dataset = ds
    
    @abstractproperty
    def requiresConfig(self):
        pass    
 

Examining the code above, we can see that a `Parser` has one function:

- `__init__()` : the constructor that assigns the class fields

There is also an attribute known as a `dataset` that contains the dataset object after a filename has been parsed.

Python uses the decorators `abstractproperty` or `abstractmethod` to identify attributes and functions that a real Parser instance must provide (in this metaclass their body's contents only contain the word `pass`).

- `requiresConfig` - Used by the GUI to indicate whether a popup window will appear for configuring the parser
- `parseFilename` - generates the DatabaseAtom ID fields from a file or filename

# Designing the `SimpleParser`

## File naming conventions
For the sake of this tutorial, let's suppose that our acquisition software produces files that follow this naming convention:

- **prefix_acqID.csv** : `locResults` come in .csv files that with a common name, followed by an underscore, and then an integer identifier. For example, HeLa_2.csv
- **prefix_acqID.txt** : `locMetadata` is found in .txt files with prefixes and acquisition ID's that match their corresponding localization data
- **prefix_acqID.tif** : `widefieldImage`'s are found in tif files that also match the corresponding the localization data.

## SimpleParser inputs and outputs
Our `SimpleParser` will be relatively, well, simple to convert these files into a format that B-Store can organize. This will hopefully give you the main idea about how you may write your own and provide a base class for doing so.

The parser's constructor will take no arguments. It's main function, `parseFilename()` will take a string as input that represents a file's name and another string representing the `datasetType` of the file. This function will set the ID fields of the `Parser` and also tell the Parser how to read the data.

Let's write an outline of this class following this design that doesn't actually do anything.

```python
class SimpleParser(Parser):
    """A simple parser for extracting acquisition information.
    
    The SimpleParser converts files of the format prefix_acqID.* into
    DatabaseAtoms for insertion into a database. * may represent .csv files
    (for locResults), .json (for locMetadata), and .tif (for widefieldImages).
    
    """
    @property
    def requiresConfig(self):
        return False
    
    def parseFilename(self):
        pass
    
```

With the skeleton above we have all the elements that are required by the interface. The problem is, there's no actual functionality at the moment.

### `parseFilename()`
Most of the work done by the Parser is the `parseFilename()` function. This function reads a filename and then fills in the appropriate fields of `Parser` parent class, like `acqID`, `prefix`, etc. The function should also take an argument that we'll call `datasetType` that tells it what kind of dataset it's looking at. The function then handles each type of dataset differently.

Let's add this argument and another named `filename`, then begin to flesh out the function.

```python
def parseFilename(self, filename, datasetType = 'Localizations', **kwargs):
        """Converts a filename into a Dataset.
        
        Parameters
        ----------
        filename      : str or Path
            A string or pathlib Path object containing the dataset's filename.
        datasetType   : str
            The type of the dataset being parsed. This tells the Parser
            how to interpret the data.
            
        """
```

First, we'll reset the parser by setting the `dataset` (provided by the `Parser` interface) to None.

```python
# Resets the parser
        self.dataset = None  
```

Next, we need to check that the dataset type provided as argument is currently registered. The reason that B-Store requires type registration is that it helps prevent reading and writing unwanted file types when traversing directories of raw data. Some unwanted files could accidentally sneak into the Datastore if their naming pattern matched that of a dataset type. Note that `config` refers to `bstore.config` and must be imported in the file for this code.

```python
 # Check for a valid datasetType
        if datasetType not in config.__Registered_DatasetTypes__:
            raise DatasetTypeError(('{} is not a registered '
                                    'type.').format(datasetType))
```

The full path to the filename is saved for later and Path objects are converted to strings:

```python
# Save the full path to the file for later.
# If filename is already a Path object, this does nothing.
self._fullPath = pathlib.Path(filename)        

# Convert Path objects to strings if Path is supplied
if isinstance(filename, pathlib.PurePath):
    filename = str(filename.name)
```

Now let's look again briefly at the naming convention of our data. All of our files follow the rule **prefix_acqID.xxx**. This means that the file type--.csv, .txt, or .tif--already tells us the dataset type. The first part of the filename will always tell us the `prefix`, which can be anything, and the last part will always be an underscore followed by an integer `acqID`.

We can easily extract this information with Python's built-in string manipulation tools and the *os.path* library.

In [3]:
from os.path import splitext

# Example
filename = 'path/to/HeLa_Control_7.csv'

# Remove the '.csv'
print('Remove the file type: ' + splitext(filename)[0])

# Remove any parent folders
print('Remove the file type and parent folders: ' + splitext(filename)[0].split('/')[-1])

# This works if there are no parents folders, too
print(splitext('HeLa_Control_7.csv')[0].split('/')[-1])

Remove the file type: path/to/HeLa_Control_7
Remove the file type and parent folders: HeLa_Control_7
HeLa_Control_7


The `prefix` and `acqID` values are easy to get. We simply split the string at the last underscore and take the part before it as the `prefix` and the part after as the `acqID`. Python's `rsplit()` function does this for us. Finally, we convert the `acqID` from a string to an integer.

In [4]:
# Isolate the root filename
rootName = splitext(filename)[0].split('/')[-1]

# Split the string at the last underscore
prefix, acqID = rootName.rsplit('_', 1)
acqID = int(acqID) # Convert the string to an integer

print('prefix is: {:s}'.format(prefix))
print('acqID is: {:d}'.format(acqID))

prefix is: HeLa_Control
acqID is: 7


The `datasetType` was already an input to the `parseFilename()` function, so we don't need to do anything to get it from the filename.

Now we have all of the ID's that parser is designed to interpret: `prefix`, `acqID`, and `datasetType`. The other ID's, which are `channelID`, `dateID`, `posID`, and `sliceID`, are optional and can be implemented in your own parser. The SimpleParser will not assign values to them.

We finish the function by building the return dataset and reading the data from the file.

```python
# Build the return dataset
idDict = {'prefix' : prefix, 'acqID' : acqID}

mod   = importlib.import_module(
    'bstore.datasetTypes.{0:s}'.format(datasetType))
dType             = getattr(mod, datasetType)
self.dataset      = dType(datasetIDs = idDict)
self.dataset.data = self.dataset.readFromFile(self._fullPath)
```

DatasetTypes are each stored in a file of the same name. The line containing `importlib` imports this file much like you would using the line `import bstore.datasetType.TYPE_NAME`. `dType` is an actual object of the class representing the datasetType. The last two lines create the instance of the datasetType and read the data from the file.

The full `parseFilename` function for `SimpleParser` looks like what follows below. The whole code block is wrapped inside a try...except statement in case an error is raised during parsing. If an error is raised, the `self.dataset` field is set to None.

```python
    def parseFilename(self, filename, datasetType = 'locResults'):
        """Converts a filename into a DatabaseAtom.
        
        Parameters
        ----------
        filename      : str or Path
            A string or pathlib Path object containing the dataset's filename.
        datasetType   : str
            The type of the dataset being parsed. This tells the Parser
            how to interpret the data.
            
        """
        # Resets the parser
        self.dataset = None        
        
        # Check for a valid datasetType
        if datasetType not in config.__Registered_DatasetTypes__:
            raise DatasetTypeError(('{} is not a registered '
                                    'type.').format(datasetType))     
        
        try:
            # Save the full path to the file for later.
            # If filename is already a Path object, this does nothing.
            self._fullPath = pathlib.Path(filename)        
            
            # Convert Path objects to strings if Path is supplied
            if isinstance(filename, pathlib.PurePath):
                filename = str(filename.name)
    
            # Remove file type ending and any parent folders
            # Example: 'path/to/HeLa_Control_7.csv' becomes 'HeLa_Control_7'
            rootName = splitext(filename)[0].split('/')[-1]
            
            # Extract the prefix and acqID
            prefix, acqID = rootName.rsplit('_', 1)
            acqID = int(acqID)
            
            # Build the return dataset
            idDict = {'prefix' : prefix, 'acqID' : acqID}
        
            mod   = importlib.import_module(
                'bstore.datasetTypes.{0:s}'.format(datasetType))
            dType             = getattr(mod, datasetType)
            self.dataset      = dType(datasetIDs = idDict)
            self.dataset.data = self.dataset.readFromFile(self._fullPath)
        except:
            self.dataset = None
            raise ParseFilenameFailure(('Error: File could not be parsed.',
                                        sys.exc_info()[0]))
```

# The `SimpleParser` class definition
Here is the final, full definition for our `SimpleParser` class.

```python
class SimpleParser(Parser):
    """A simple parser for extracting acquisition information.
    
    The SimpleParser converts files of the format prefix_acqID.* into
    Datasets for insertion into a datastore. * represents filename
    extensions like .csv, .json, and .tif.
    
    Attributes
    ----------
    requiresConfig : bool
        Does parser require configuration before use?
    
    """
    @property
    def requiresConfig(self):
        return False
        
    def parseFilename(self, filename, datasetType = 'Localizations', **kwargs):
        """Converts a filename into a Dataset.
        
        Parameters
        ----------
        filename      : str or Path
            A string or pathlib Path object containing the dataset's filename.
        datasetType   : str
            The type of the dataset being parsed. This tells the Parser
            how to interpret the data.
            
        """
        # Resets the parser
        self.dataset = None        
        
        # Check for a valid datasetType
        if datasetType not in config.__Registered_DatasetTypes__:
            raise DatasetTypeError(('{} is not a registered '
                                    'type.').format(datasetType))     
        
        try:
            # Save the full path to the file for later.
            # If filename is already a Path object, this does nothing.
            self._fullPath = pathlib.Path(filename)        
            
            # Convert Path objects to strings if Path is supplied
            if isinstance(filename, pathlib.PurePath):
                filename = str(filename.name)
    
            # Remove file type ending and any parent folders
            # Example: 'path/to/HeLa_Control_7.csv' becomes 'HeLa_Control_7'
            rootName = splitext(filename)[0].split('/')[-1]
            
            # Extract the prefix and acqID
            prefix, acqID = rootName.rsplit('_', 1)
            acqID = int(acqID)
            
            # Build the return dataset
            idDict = {'prefix' : prefix, 'acqID' : acqID}
        
            mod   = importlib.import_module(
                'bstore.datasetTypes.{0:s}'.format(datasetType))
            dType             = getattr(mod, datasetType)
            self.dataset      = dType(datasetIDs = idDict)
            self.dataset.data = self.dataset.readFromFile(self._fullPath)
        except:
            self.dataset = None
            raise ParseFilenameFailure(('Error: File could not be parsed.',
                                        sys.exc_info()[0]))
```

# Example
For this example, you can use the test data in the [bstore_test_files](https://github.com/kmdouglass/bstore_test_files). Download the files from Git using the link and change the path below to point to *parsers_test_files/SimpleParsers* on your machine.

In [5]:
from pathlib import Path
import bstore.config

# Register the dataset types
bstore.config.__Registered_DatasetTypes__ = ['Localizations', 'LocMetadata', 'WidefieldImage']

# Specify the test dataset
pathToFiles = Path('../../bstore_test_files/parsers_test_files/SimpleParser/')

In [6]:
# Create the SimpleParser
sp = parsers.SimpleParser()

# Specify a file to parse
file = pathToFiles / Path('HeLaL_Control_1.csv')

# Parse this file
sp.parseFilename(file, datasetType = 'Localizations')

# Summarize the localization data
print(sp.dataset)

Localizations: {'prefix': 'HeLaL_Control', 'acqID': 1}


In [7]:
# Return a Dataset that can be inserted into a B-Store database
ds = sp.dataset.data
ds.describe()

Unnamed: 0,x,y,z,frame,uncertainty,intensity,offset,loglikelihood,sigma
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,8994.581818,59467.181818,0.0,50.0,5.993009,10992.2,720.831818,1847.315455,179.28
std,1170.696295,1687.184034,0.0,0.0,3.013617,8734.24533,367.812667,3631.486533,39.753501
min,6770.0,56713.0,0.0,50.0,1.0787,3107.8,270.24,243.08,111.56
25%,8024.15,58228.5,0.0,50.0,4.3144,7599.9,508.74,554.72,158.095
50%,9163.2,59647.0,0.0,50.0,6.5072,8408.1,641.58,643.07,198.22
75%,9866.6,60286.0,0.0,50.0,7.18055,11132.6,922.995,1064.22,201.995
max,10350.0,62858.0,0.0,50.0,10.883,35038.0,1346.0,12727.0,218.79


# Summary

- A `Parser` reads raw data files and converts them into a format for insertion into a B-Store datastore.
- A `SimpleParser` is built-in into B-Store already.
- The `SimpleParser` knows how to read files of the format **prefix**\_**acqID**.filetype
- When writing a `Parser`, you need to specify at least two attributes from the `Parser` interface: `parseFilename()`, and `requiresConfig`.
- `parseFilename()` knows how to extract B-Store identifiers from files.
- The dataset that is parsed is stored in `parser.dataset`. The data from the file is inside `parser.dataset.data`.