# Loading data acquisition source files

The process of loading recordings from source data files (.tdms, .continuous, ...) is mostly abstracted by a new class called `ecogdata.devices.load.file2data.FileLoader`. Here is the full constructor signature. Most arguments are adapted from the previous ad-hoc loading methods defined on each system. Explanation follows.

```python
Init signature:
FileLoader(
    experiment_path,
    recording,
    electrode,
    bandpass=None,
    notches=None,
    units='uV',
    load_channels=None,
    trigger_idx=(),
    mapped=False,
    resample_rate=None,
    use_stored=True,
    save_downsamp=True,
    store_path=None,
    raise_on_glitch=False,
)
Docstring:      <no docstring>
Init docstring:
Data file mapping/loading. Supports downsampling.

Parameters
----------
experiment_path: str
    File system path where recordings are found
recording: str
    Name of recording in path
electrode: str
    Identifier of the channel map from `ecogdata.devices.electrode_pinouts`
bandpass: sequence
    Bandpass edges (lo, hi) -- use -1 in either place for one-sided bands
notches: sequence
    Sequence of line frequencies to notch
load_channels: sequence
    If only a subset of channels should be loaded, list them here. For example, to load channels from only one
    port, use `load_channels=range(128)`. Otherwise all channels are used.
trigger_idx: int or sequence
    The index/indices of a logic-level trigger signal in this class's trigger_array
mapped: bool or str
    If True, leave the dataset mapped to file. Otherwise load to memory. If the (mode == 'r+') then the
    mapped source will be writeable. (This will make a copy of the primary or downsampled datasource.)
resample_rate: float
    Downsample recording to this rate (must evenly divide the raw sample rate)
use_stored: bool
    If True, look for a pre-computed downsampled dataset
save_downsamp: bool
    If True, save a new downsampled dataset
store_path: str
    Save/load downsampled datasets at this path, rather than `experiment_path`
raise_on_glitch: bool
    If True, raise exceptions on unexpected events. Otherwise try to proceed with warnings.
File:           ~/work/ecogdata/ecogdata/devices/load/file2data.py
Type:           type
Subclasses:     OpenEphysLoader, ActiveLoader, RHDLoader
```

In addition, these class level attributes define more system-specific rules:

```python
    # multiplier to scale raw data units to micro-volts
    scale_to_uv = 1.0
    # name (key) of the dataset in the HDF5 file
    data_array = 'data'
    # name of the dataset where a trigger signal may be found
    trigger_array = 'data'
    # name(s) of other auxiliary to keep track of
    aligned_arrays = ()
    # transpose_array is True if the data matrix is Time x Channels
    transpose_array = False
    # allowed file extensions
    permissible_types = ['.h5', '.hdf']
```

## Case study: Intan .rhd format

Intan RHD files recorded with 2-byte integer resolution in a block structured files. They are typically recorded in 3 minute segments per file, resulting in multiple files. Due to these hurdles, RHD files are not loaded directly, but require a preprocessing step that packs multiple files into HDF5 format. That is not covered here (see `convert_rhd.py` from rhd-to-hdf5).

Once packed, Intan format loading is the more simple available case study, handled almost entirely by the abstract rules in `FileLoader`. This is the entire body of the method `ecogdata.devices.load.intan.load_rhd`:

```python
loader = RHDLoader(experiment_path, test, electrode,
                   bandpass=bandpass,
                   notches=notches,
                   units=units,
                   load_channels=load_channels,
                   trigger_idx=trigger_idx,
                   mapped=mapped,
                   resample_rate=useFs,
                   use_stored=use_stored,
                   save_downsamp=save_downsamp,
                   store_path=store_path,
                   raise_on_glitch=raise_on_glitches)
return loader.create_dataset()
```

* Step 1: create a `FileLoader` subtype (`RHDLoader`) with given arguments.
* Step 2: call the `create_dataset` method on the loader

The subclass `RHDLoader` is itself fairly simple. It defines some attributes and modifies logic from the parent class to track the sampling rate definition in the original header info (preserved in a JSON string in the `h5py.File.attrs` object). In this and other casees, the `create_dataset` method is completely generalized from the basic `FileLoader` class.

```python
class RHDLoader(FileLoader):
    scale_to_uv = 0.195
    data_array = 'amplifier_data'
    trigger_array = 'board_adc_data'
    aligned_arrays = ('board_adc_data',)
    transpose_array = False

    def raw_sample_rate(self):
        """
        Return full sampling rate (or -1 if there is no raw data file)
        """

        if os.path.exists(self.primary_data_file):
            with h5py.File(self.primary_data_file, 'r') as h5file:
                header = json.loads(h5file.attrs['JSON_header'])
                samp_rate = header['sample_rate']
        else:
            samp_rate = -1
        return samp_rate

    def create_downsample_file(self, data_file, resample_rate, downsamp_file):
        new_file = super(RHDLoader, self).create_downsample_file(data_file, resample_rate, downsamp_file)
        with h5py.File(data_file, 'r') as h5file:
            header = json.loads(h5file.attrs['JSON_header'])
        with h5py.File(new_file, 'r+') as h5file:
            print('Putting header and closing file')
            header['sample_rate'] = resample_rate
            h5file.attrs['JSON_header'] = json.dumps(header)
        return new_file
```

### Constructing a `FileLoader`

Create a RHDLoader to walk through the steps. Use `mapped=True` to indicate that the new dataset will have a `MappedSource` for the signal array.

In [None]:
from ecogdata.devices.load.intan import RHDLoader
from ecogdata.expconfig import load_params
import os
# This is for gabilan
# exp_path = os.path.join(
#     os.path.join(os.path.join(load_params().network_exp, '..'), 'Human_uECoG'),
#     'Surgery_2019_07_16'
# )
exp_path = os.path.join(
    os.path.join(load_params().local_exp, 'Human_uECoG'),
    'Surgery_2019_07_16'
)

rec_name = 'Surgery_2019-07-16_short'
electrode = 'psv_256_rhd'
loader = RHDLoader(exp_path, rec_name, electrode, mapped=True)

The final step in constructing a `FileLoader` is to determine data paths for source files and potential downsampling files. In this case, no downsampling is required (that comes next). The method `find_source_files` determines 

1. the current primary source
1. the destination file for downsampling (a value of None here)
1. the units scale for the primary source (default output units are microvolts)

If no primary data is found, then a `DataPathError` is raised.

In [None]:
loader.find_source_files()

#### Alternative 1: downsampling with an existing downsampled source

Loading with downsampling enabled engages a few more argument:
```python
resample_rate: float
    Downsample recording to this rate (must evenly divide the raw sample rate)
use_stored: bool
    If True, look for a pre-computed downsampled dataset
store_path: str
    Save/load downsampled datasets at this path, rather than `experiment_path`
```

By default, `use_stored` is True which allows the use of pre-computed downsample sources. `store_path` can be specified if these pre-computed sources would be found somewhere other than the exp_path.

In [None]:
loader = RHDLoader(exp_path, rec_name, electrode, resample_rate=2000)
loader.find_source_files()

A resample rate of 2000 S/s is requested. Since there happens to be a pre-computed downsample file, that file shows up as a primary source. 

The downsample source must be a HDF5 file that is **"channel-compatible"** with the original file. E.g., for a recording from a single 64-channel headstage and a 61-site rat electrode, 3 data channels are present without electrode data. **These three channels must also be present in the HDF5 array.** Any arrays that are "aligned" with the dataset (i.e. `board_adc_data` here) also must be present in the downsample source. 

In `FileLoader`, these rules are satisfied by a combination of "mirroring" the primary data source to an empty reduced sampling rate source, and then changing the sampling from the primary source to the new source. Roughly:

```python
downsamp = datasource.mirror(new_rate_ratio=downsamp_ratio, mapped=True, channel_compatible=True,
                             filename=downsamp_file)
datasource.batch_change_rate(downsamp_ratio, downsamp)
```

*The _FsNNNN suffix is standard convention throughout the `FileLoader` class.* Another convention is that downsample sources have floating point arrays (single or double precision) and values are in micro-volt units. Hence the final output is now 1 instead of 0.195 for the original integer sources.

#### Alternative 2: downsampling without an existing downsampled source
If no downsample source exists, then another argument is engaged: whether to save the new downsample source or not. Source saving will respect `store_path`, which is the same as `exp_path` by defualt.
```python
save_downsamp: bool
    If True, save a new downsampled dataset
```

In [None]:
loader = RHDLoader(exp_path, rec_name, electrode, resample_rate=5000)
loader.find_source_files()

In this case, there is no precomputed source for 5000 S/s. The primary source is the full sample-rate, integer resolution file. The second value returned is now the name of the destination file for downsampling. The last value is the original ADC-to-microvolts scale. With this scenario, `FileLoader.create_downsample` (or the overloaded method) will create an appropriate downsample source.

### Loading steps in detail

Basic rundown

1. Establish which data channels belong to what kind of source stream: electrode, ground, and reference channels
1. Create new downsample source if needed (into a file if appropriate) --> this becomes the new primary source
  
  1. File creation happens if either `save_downsamp` is True, or if the final data source will be mapped.

1. Map the prevailing primary source file

  1. `MappedSource` types are created whether or not the final source is mapped, *unless*
  1. Another possibility is that downsampling is needed, but no downsample source file is needed. In that case `downsample_and_load` is called.
  
1. At this point data is a `MappedSource` (or `PlainArraySource` if `downsample_and_load` was used) at the correct sampling rate

  1. If the final source is to be loaded and is now mapped, then create loaded mirrors.
  1. If the data source is mapped but not writeable, then writeable mirrors are created if necessary
  
1. Bandpass filtering
1. Notch filtering
1. Event "triggers" are found and processed

A `ecogdata.utils.Bunch` object is returned with various attributes including:

* `.data`: an `ElectrodeSource` (either `MappedSource` or `PlainArraySource`)
* `.Fs`: sampling rate
* `.chan_map`: a `ecogdata.channel_map.ChannelMap` object
* `.units`: the data signal units scale

#### Step 1: channel mapping

This step analyzes the channel mapping information keyed by the `electrode` parameter to figure out which data channels belong to ecog electrodes, grounded input, and reference electrodes. For the 256 channel electrode, everything goes to electrodes.

In [None]:
channel_map, electrode_chans, ground_chans, ref_chans = loader.make_channel_map()
print(list(map(type, loader.make_channel_map())))
print(list(map(len, loader.make_channel_map())))

#### Step 2: downsampling to file

This step involves calling `FileLoader.create_downsample_file`, which is fairly generalized. The `RHDLoader` first calls its super-class method, and then follows up by inserting the correct sampling rate information into the downsample source file. As mentioned before, the meat of this method is mirroring and then changing rate:
```python
downsamp = datasource.mirror(new_rate_ratio=downsamp_ratio, mapped=True, channel_compatible=True,
                             filename=downsamp_file)
datasource.batch_change_rate(downsamp_ratio, downsamp)
```

#### Step 3-4: Mapping & loading

Normally a primary source is mapped as a `MappedSource` as a precursor to loading, since the data channels get organized correct and the `MappedSource.mirror(mapped=False)` has nice abstractions for handling HDF5 data. An exception is downsampling without saving, which will skip the costly  creation of a downsample file and load directly to memory. Writeable mapped mirrors are created (`MappedSource.mirror` with appropriate args) if filtering is needed, or if the loader was created with `mapped='r+'` to indicate "writeable no matter what".

#### Step 5-6: filtering

Accomplished through `datasource.filter_array` and `datasource.notch_filter`, which are defined on any `ElectrodeSource`.

Step 7: trigger parsing

The basic scenario is that the row at index `FileLoader.trigger_idx` in `FileLoader.trigger_array` contains a rising-edge coded event signal, which will be detected in `FileLoader.find_trigger_signals`. More complex cases (e.g. pseudo demuxed BNC channels from the National Instruments DAQ) require over-loading logic.