# Writing custom readers

Readers were introduced in version 1.1.0 to better enable B-Store to read data from various filetypes. A Reader is used to read the data contained in a particular file into a Python datatype, such as Pandas DataFrame or Numpy array. This intermediate Python datatype is a sort of temporary holding spot before the data is then placed into a HDF file. The abstract base class `Reader` defines a common interface for users to write their own routines for reading any type of file into Python.

To use a `Reader` when creating a `HDFDatastore`, one passes one or more instances of objects that subclass `Reader` and that to `HDFDatastore.build()` or `Parser.parseFilename()`. This will be described below.

In this tutorial, we'll begin by looking at the abstract base class called `Reader`. After studying the code, we'll look at a specific implementation of a `Reader` known as `CSVReader`, an object that is used to read generic CSV files and that is highly customizable.

## Special note

Version 1.1.0 introduced the Reader interface and two readers: `CSVReader` and `JSONReader`. In this version they only work for Localizations, FiducialTracks, and AverageFiducial datasetTypes. Finally, they cannot be specified in the GUI, but may be specified using the new `readers` parameter of `HDFDatastore.build()`. All of these limitations should be gone in future versions of B-Store. Readers for non-tabulated data, such as images, should follow as well.

# The `Reader` interface

Let's begin by looking at the code for a `Reader`.

In [1]:
# Import B-Store's parsers module
from bstore import readers

# Used to retrieve the code
import inspect

In [2]:
print(inspect.getsource(readers.Reader))

class Reader(metaclass=ABCMeta):
    """Reads the data for a given DatasetType from file.

    """
    @abstractmethod
    def __call__(self, filename, **kwargs):
        """Reads the data inside a file into a Python object.

        Note that a return type function annotation must be specified in the
        concrete methods to automatically match a Reader with a DatasetType.

        Parameters
        ----------
        filename : str or buffer object
            The file containing the data to read.
        **kwargs : dict
            key-value arguments to pass to the auxillary functions used by the
            file reading functions.

        """
        pass

    @abstractmethod
    def __repr__(self):
        pass

    @abstractproperty
    def __signature__(self):
        """The custom Signature object for the class's __call__ method.

        """
        pass

    @abstractmethod
    def __str__(self):
        """User-friendly and short description of the Reader.

        Thi

Looking at the code above, we can see that a Reader must have three methods and one property. The methods are:

1. __call__ : This makes the object a callable, i.e. an instance may be used like a function.
2. __repr__ : This is a Python builtin function that will return a that may be used to instantiate a Reader instance.
3. __str__ : This is a more user-friendly method that returns a string describing what the Reader does.

The property that must be defined is `__signature__`. The reason for this property is that we must define the call signature for the Reader object when it is called like a function. [The call signature](https://docs.python.org/3/library/inspect.html#inspect.Signature) represents the arguments and their default values that are passed to the Reader when it is called like a function inside the `readFromFile` method of a DatasetType. Specifying a call signature will enable us to modify the arguments through a GUI window with all the argument names and values. Without a signature, we cannot easily "look inside" the Reader to figure out what arguments it takes.

# The `CSVReader` object

Let's now take a look at a concrete Reader, the `CSVReader`, which enables us to read generic .csv files in a highly customizable way.

In [3]:
reader = readers.CSVReader
print(inspect.getsource(reader))

class CSVReader(Reader):
    """Reads data from a generic comma separated values (CSV) file.

    This reader utilizes the Pandas read_csv() function, which allows many
    different parameters to be adjusted, such as the value separator. For an
    explanation of the parameters, see the reference below.

    The constructor for CSVReader creates the class's custom call signature

    References
    ----------
    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

    """

    def __init__(self):
        # Create the custom call signature for this Reader
        # https://docs.python.org/3.5/library/inspect.html#inspect.Signature
        f = pd.read_csv
        sig = inspect.signature(f)
        p1 = inspect.Parameter(
            'filename', inspect.Parameter.POSITIONAL_ONLY)

        newParams = [p1] + [param for name, param in sig.parameters.items()
                            if name != 'filepath_or_buffer']

        self._sig = sig.replace(parameters=newPa

## \_\_init\_\_()

We define the CSVReader with the line:

```python

class CSVReader(Reader):

```

The `(Reader)` in parantheses tells Python that the object subclasses the `Reader` abstract base class discussed above.

Following the docstring, there is the `__init__` function which serves as the constructor for the object. (Note that defining an `__init__` method is not required.) This Reader actually uses the [Pandas read_csv function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) to read .csv files into Python DataFrames. Thefore, we extract the function signature from `read_csv` in the line

```python

sig = inspect.signature(pd.read_csv)

```

Now, we have to slightly modify the signature for our Reader because `read_csv` accepts an argument called `filename_or_buffer` as its first argument. However, inside each datasetType's `readFromFile` method, we specify the filePath as the first positional argument. For example:

In [4]:
from bstore.datasetTypes import Localizations as locs
readFromFile = locs.Localizations.readFromFile
print(inspect.getsource(readFromFile))

    @staticmethod
    def readFromFile(filePath, **kwargs):
        """Read a file on disk containing the DatasetType.

        Parameters
        ----------
        filePath : Path
            A pathlib object pointing towards the file to open.

        Returns
        -------
        Pandas DataFrame

        """
        if ('reader' in kwargs) and (kwargs['reader']):
            reader = kwargs['reader']
            return reader(str(filePath), **kwargs)
        else:
            # Default read behavior
            return pd.read_csv(str(filePath))



In the GUI, filePath is not specified by the user but rather by B-Store's machinery for automatically detecting files. When a GUI window appears to allow someone to set the parameters of `CSVReader`, we therefore do not want them to be able to set the argument `filename_or_buffer`.

We define a custom call signature inside `__init__()` with the lines:

```python

p1  = inspect.Parameter(
    'filename', inspect.Parameter.POSITIONAL_ONLY)
            
newParams = [p1] + [param for name, param in sig.parameters.items()
                          if name != 'filepath_or_buffer']
                                    
self._sig = sig.replace(parameters = newParams)

```

`p1` is a custom parameter that is set to be `POSITIONAL_ONLY`. Doing this ensures that we can easily separate it from the rest of the arguments of `read_csv`, which are of the kind `POSITIONAL_OR_KEYWORD`. We then add this new custom parameter onto the all the other parameters of `read_csv` **except** for `filepath_or_buffer` with the lines

```python

p1  = inspect.Parameter(
    'filename', inspect.Parameter.POSITIONAL_ONLY)
            
newParams = [p1] + [param for name, param in sig.parameters.items()
                          if name != 'filepath_or_buffer']

```

Finally, the Reader's `_sig` property is reset to this new Signature with the line `self._sig = sig.replace(parameters = newParams)`.

## __signature__

To return this newly-defined signature object when the `Signature` function is called on our class, we tell the CSVReader's signature property to return it:

```python

@property        
def __signature__(self):
    return self._sig

```

## \_\_call\_\_(self, filename, **kwargs)

The \_\_call\_\_() method actually performs the act of reading the data from a file into a DataFrame. First, all keyword arguments that are not part of `read_csv` are removed from the \*\*kwargs dict. If they are not removed, `read_csv` will raise an error about an unrecognized argument. We also ensure that filename is not passed to `read_csv`, in case it was passed as a keyword.

```python

kwargs = {k: v for k, v in kwargs.items()
               if k in self.__signature__.parameters
               and k != 'filename'}

```

Next, we simply call `read_csv` with the `filename` argument and the new `kwargs` dict and return the DataFrame as a result:

```python

return pd.read_csv(filename, **kwargs)

```

By passing \*\*kwargs into read_csv, we can assign values to *any* of `read_csv`'s arguments. `read_csv` accepts a very large number of arguments to allow you to customize its behavior. This powerful customizability is therefore translated to B-Store. One last important thing to note is the part at the end of the first line of \_\_call\_\_()'s definition:

```python

def __call__(self, filename, **kwargs) -> pd.DataFrame:

```

`-> pd.DataFrame` tells Python what datatype the function returns. This is also required by Readers and is used to automatically detect what Readers are associated with what datasetTypes. For example, Localizations are represented internally as DataFrames. `-> pd.DataFrame` tells B-Store that we can associate this reader with any datasetType that has a DataFrame as its internal representation.

## \_\_repr\_\_() and \_\_str\_\_()

These methods should be self-explantory for Python developers. The first returns a string used by developers to represent how the instance is created and the second is a user-friendly string that can be displayed in places like the GUI.

# Example
For this example, you can use the test data in the [bstore_test_files](https://github.com/kmdouglass/bstore_test_files). Download the files from Git using the link and change the path below to point to *bstore_test_files/readers_test_files/csv/tab_delimited* on your machine.

In [5]:
from bstore import parsers
from bstore import readers
from pathlib import Path

filePath     = Path('../../bstore_test_files/readers_test_files/csv/tab_delimited/')
filename     = filePath / Path('HeLaL_Control_1.csv')

In [6]:
# Initialize the Parser and Reader                        
parser = parsers.SimpleParser()
reader = readers.CSVReader()

In [7]:
# reader keyword argument passes the CSVReader instance;
# all other keyword arguments are passed to CSVReader's __call__ function.
parser.parseFilename(filename, datasetType = 'Localizations', reader = reader, sep = '\t')

parser.dataset.data.head()

Unnamed: 0,x,y,z,frame,uncertainty,intensity,offset,loglikelihood,sigma
0,6770.0,59386,0,50,9.5138,4386.6,270.24,425.92,218.79
1,7958.1,59762,0,50,6.7329,8310.3,562.65,619.47,199.5
2,7840.8,60819,0,50,2.1987,15671.0,1261.1,1691.4,119.47
3,8090.2,59801,0,50,7.6282,6952.3,642.53,506.19,206.46
4,9010.3,59647,0,50,6.5814,8408.1,684.29,821.24,197.9


Here, we parsed the file containing localization data using the `CSVReader`. After creating the reader, we passed it as a keyword argument to parser's `parseFilename` method:

```python

parser.parseFilename(filename, datasetType = 'Localizations', reader = reader, sep = '\t')

```

The maining keyword argument, `sep`, was passed to `read_csv` inside the reader because all keyword arguments after `datasetType` are passed to the reader object. We can pass other keyword arguments to `read_csv`, such as `skiprows`:

In [8]:
parser.parseFilename(filename, datasetType = 'Localizations', reader = reader, sep = '\t', skiprows = 1)

parser.dataset.data.head()

Unnamed: 0,6770.0,59386,0,50,9.5138,4386.6,270.24,425.92,218.79
0,7958.1,59762,0,50,6.7329,8310.3,562.65,619.47,199.5
1,7840.8,60819,0,50,2.1987,15671.0,1261.1,1691.4,119.47
2,8090.2,59801,0,50,7.6282,6952.3,642.53,506.19,206.46
3,9010.3,59647,0,50,6.5814,8408.1,684.29,821.24,197.9
4,9163.2,60771,0,50,2.5165,13696.0,1161.7,1307.2,124.29


# Passing readers to `HDFDatastore.build()`

The `HDFDatastore.build()` method, which is the main method used to create Datastores, now accepts a keyword argument known as `readers`. This argument should be a dict whose keys are the names of DatasetTypes and whose values are instances of a particular reader to use when reading data.

For example, let's say we want to build a Datastore from a small experiment and specify what readers to use when reading different dataset types. (You may need to change testDataRoot to point to the right folder containing the bstore test files.)

In [None]:
import bstore.config as config
from bstore import database

testData = Path('../../bstore_test_files/parsers_test_files/SimpleParser/')
dsName = 'test_datastore.h5'
config.__Registered_DatasetTypes__ = [
    'Localizations', 'LocMetadata', 'WidefieldImage']   

parser = parsers.SimpleParser()
filenameStrings = {
    'Localizations'  : '.csv',
    'LocMetadata'    : '.txt',
    'WidefieldImage' : '.tif'}
readersDict = {'Localizations': readers.CSVReader()}

# Note sep and skiprows are keyword arguments of CSVReader; readTiffTags is
# a keyword argument for the WidefieldImage readfromFile() method
with database.HDFDatastore(dsName) as myDS:
    res = myDS.build(parser, testData, filenameStrings,
                     readers=readersDict, sep=',', skiprows=2,
                     readTiffTags = False)

The above code sets up a HDFDatastore build by first specifying the location of the data, the name of the HDFDatastore, and registering the desired DatasetTypes.

```python
testData = Path('../../bstore_test_files/parsers_test_files/SimpleParser/')
dsName = 'test_datastore.h5'
config.__Registered_DatasetTypes__ = [
    'Localizations', 'LocMetadata', 'WidefieldImage']  
```

Next, a parser is specified and the naming pattern for the different datasets is specified like usual:

```python
parser = parsers.SimpleParser()
filenameStrings = {
    'Localizations'  : '.csv',
    'LocMetadata'    : '.txt',
    'WidefieldImage' : '.tif'}
```

We specify that we want to use the CSVReader for reading `Localizations` Datasets from files the `readersDict`:

```python
readersDict = {'Localizations': readers.CSVReader()}
```

Finally, we build the Datastore inside the *with...as* context manager like usual. We can pass keyword arguments to the various readers by specifying them **after the readers argument**. In this case, we send `sep=','` and `skiprows=2` to `CSVReader` and `readTiffTags=False`, which is sent to the `readFromFile` function of WidefieldImages.

```python
with database.HDFDatastore(dsName) as myDS:
    res = myDS.build(parser, testData, filenameStrings,
                     readers=readersDict, sep=',', skiprows=2,
                     readTiffTags = False)
```

Currently, readers may only be specified in this manner for Localizations, FiducialTracks, and AverageFiducial dataset types. All other specifications will be ignored.

# Summary

- A Reader may be used to actually read the raw data from a file whose name is currently be parsed by a Parser.
- Readers are defined by an abstract base class known as `Reader`.
- To define a concrete Reader, we have to define three methods and one property. The methods are `__call__()`, `__repr__()`, and `__str__()`. The property is `__signature__`.
- Most of the work of creating a Reader goes into defining its signature. The signature is used to automatically detect what arguments the Reader requires and is used primarily by the GUI.
- Any function or code at all for reading a raw data file may used inside `__call__()`. For `CSVReader`, we chose to use Pandas `read_csv` function because it is highly customizable.
- To associate Readers with specific datasetTypes, we should use function annotations to specify the return type of the Reader's `__call__()` method.