In [None]:
import os
import numpy as np
import h5py
from time import time
from ecogdata.expconfig import session_info, load_params, OVERRIDE
from ecogdata.devices.data_util import params_table
from ecogdata.devices.load.active_electrodes import ActiveLoader
from ecogdata.datasource.memmap import MemoryBlowOutError

In [None]:
%matplotlib inline

## Case study: active electrode mapping

To motivate the different requirements of mapping logic, we'll look at a recording from the SiO2 rat active electrode (Chiang, STM 2019). These recordings were made with National Instruments and saved to TDMS file format, which are either converted to HDF5 via Python or via Matlab scripts. Both variations may be found, and they are basically interchangable modulo a transpose in the "data" array.

**Get the session info for a recording from an implanted array.**

In [None]:
info = session_info('16011/2016-09-01_active', params_table=params_table)
daq = info.daq
headstage = info.headstage
electrode = info.electrode
bnc = info.bnc

In [None]:
# A transposed file to map
# f_path = '/Users/mike/experiment_data/Viventi 2016-09-01 Active Rat 16011/test_001.mat'
f_path = os.path.join(info.nwk_path, 'test_001.mat')
with h5py.File(f_path, 'r') as f:
    print('data shape:', f['data'].shape)

In [None]:
loader = ActiveLoader(info.nwk_path, 'test_001', electrode, daq, headstage, bnc=bnc, trigger_idx=0, mapped=True)
channel_map, electrode_channels, other_channels, ref_channels = loader.make_channel_map()

64 electrode data channels were stored within the first 80 channels, *but not contiguously*. (Those missing channels  included the leakage measurements.)

In [None]:
print(electrode_channels)

Each group of 8 channels corresponded to the same MUX, and were physically on the same electrode array column. But to make the mapping maximally complicated

* physical rows are permuted within each MUX!
* physical columns are permuted across MUXes!

Note the row/column sequence of the first two MUXes (16 channels):

In [None]:
# print the ecog array (row, col) of each channel
print(list(zip(*channel_map.to_mat()))[:16])

## Logic of MappedSource

The requirements for providing consistent array interaction with file-mapping electrode data are these:

* Expose a (mutable) subset of on-disk data that correspond to the ecog signal array
  + This set needs to be mutable to enable channel selection
* Expose slicing syntax: `output_array = a[1:3, 20:40]` and `a[1:3, 20:40] = input_array`
* Guard against loading too much data at a time, since the raw data might be massive
* Present signal matrix orientation consistently, irrespective of raw data tranposes
* Map multiple files as if they were joined end on end (*only if every file has the same layout*)

For a guide to generic DataSource interaction, see the [data sources notebook](data_sources.ipynb).

In [None]:
# Map the source data file and create a dataset "bunch"
dataset = loader.create_dataset()
print(dataset)
mapped_source = dataset.data
print('Electrode signal shape:', mapped_source.shape)

### Electrode channels & active channel selection

The datasource for this dataset is a `MappedSource`. Note that its shape leads with 64 channels. This is the convention for all `ElectrodeDataSource` types.

The set of electrode channels informs the MappedSource which channels from the underlying "data_buffer" to expose. These are the channels we saw before:

In [None]:
print('MappedSource channels:', mapped_source._electrode_channels)
print()
print('Full data buffer shape:', mapped_source.data_buffer.shape)

<div class="alert alert-info">

**Note**: Attributes leading with an underscore (e.g. "_.electrode_channels") are considered to be "private". There is no notion of private and public class members in Python, so this is only understood by convention. The idea is that these attributes or methods should not be used/abused unless you are quite certain about what you are doing.

</div>

The mapped source also has active channels, which are a subset of the electrode channels.

In [None]:
print('Active subset:', mapped_source._active_channels)

Being a "private" attribute, this set should not be manipulated directly. Instead, you can apply a channel mask in the form of a 64-channel binary array (the value of "False" means de-select a channel). Suppose there was a switching problem on the first (in channel-space) entire electrode column:

In [None]:
# get current mask: all True by default
mask = mapped_source.binary_channel_mask
mask[:8] = False
mapped_source.set_channel_mask(mask)
print('New active subset:', mapped_source._active_channels)
print()
print('New shape:', mapped_source.shape)
print()
print('Buffer shape:', mapped_source.data_buffer.shape)

So the active channels now exclude the de-selected channels. The data buffer, of course, is not changed.

In practice, the ChannelMap should also be synchronized:

In [None]:
sub_map = dataset.chan_map.subset(mask)
sub_map.image()

The channel mask can also be reset or unset on the same mapped source. For example, to unset just provide an all-True mask (or provide the value `None`)

In [None]:
mask[:] = True
mapped_source.set_channel_mask(mask)
# This also works to reset:
mapped_source.set_channel_mask(None)
print('Shape:', mapped_source.shape)

### Array access

A `MappedSource` exposes read/write slicing with syntax that's (mostly) compatible with regular `numpy.ndarray` types. This access follows from a stack of logic abstractions, from bottom to top:

* `h5py.Dataset`: 
  + exposes slicing syntax on the underlying HDF5 file, mapping slice ranges to the correct blocks in the dataset B-tree.
* `HDF5Buffer` and `BufferBinder` types in `ecogdata.datasource.array_abstractions`: 
  + scatter-gather slicing optimized for HDF5 "chunks"
  + hand-off between "joined" mapped files (see also [joining datasets](joining_datasets.ipynb))
  + context dependent transpose
* `MappedSource`:
  + channel selection
  + memory-load checks
  
In this section, we'll focus on the top layer. The job of `MappedSource` is fairly simple: it translates a slice for the *current* signal array geometry into a slice for the underlying data buffer. The translation is handled by `_slice_logic`:

In [None]:
# slice from channel 5 to 15, from time 0 to 1000 (use the numpy.s_ object to construct slices)
slicer = np.s_[5:15, 0:1000]
print('Input slices:', slicer)
buffer_slice = mapped_source._slice_logic(slicer)
print('Buffer slices:', buffer_slice)

Two things have happened

* slice(5, 15, None) has translated to a discontiguous set corresponding to `_active_channels[5:15]`
* the buffer slice is transposed, since the buffer shape is (channels, time)

Other general forms of slicing are also translated:

In [None]:
s = np.s_[[1, 4, 10], 10:20]
print('Input slice:', s, '---> Buffer slice:', mapped_source._slice_logic(s))
s = np.s_[30, 10:20]
print('Input slice:', s, '---> Buffer slice:', mapped_source._slice_logic(s))

The mapped source also guards against loading too much memory at a time, which is governed by the "memory_limit" in the global params file. For demonstration, we'll temporarily over-ride this limit to a small value.

In [None]:
print('Normal memory limit in bytes:', load_params().memory_limit)
print('With 100 kB override...')
# calcualate max samps
max_samps = int(1e5 / mapped_source.dtype.itemsize / len(mapped_source)) + 1
buffer_slice = mapped_source._slice_logic(np.s_[:, :max_samps])
try:
    # set memory limit to 100 kB
    OVERRIDE['memory_limit'] = 1e5
    mapped_source._check_slice_size(buffer_slice)
except MemoryBlowOutError as e:
    print('Got a blow out error!')
    print('Error message', e)
finally:
    del OVERRIDE['memory_limit']

If you *really* know what you're doing, a big slice can be made using a context manager

In [None]:
print('With 100 kB override and using context...')
try:
    # set memory limit to 100 kB
    OVERRIDE['memory_limit'] = 1e5
    with mapped_source.big_slices():
        # Any slices within this context block will not be checked for blow-out
        mapped_source._check_slice_size(buffer_slice)
        print('No error on big slice')
except MemoryBlowOutError as e:
    print('Got a blow out error!')
    print('Error message', e)
finally:
    del OVERRIDE['memory_limit']

Memory-check behavior can also be turned off for an object by setting `raise_on_big_slices=False` in the constructor.

#### Subprocess data caching

Since slicing on a MappedSource might incur a significant time suck, slicing into a shared memory cache can happen in the background while the foreground process does other work. This is employed in data iteration (see [data sources notebook](data_sources.ipynb)) to allow the main process to work on the currently yielded block while the background process fills the next block's cache.

This is handled in `cache_slice()` and `get_cached_slice()`

In [None]:
t1 = time()
# cache a big slice
mapped_source.cache_slice(np.s_[:, :50000])
t2 = time()
print('Time mark 1: {}'.format(t2 - t1))
print('Doin stuff....')
cached_slice = mapped_source.get_cached_slice()
t3 = time()
print('Total cache time: {}, time available for foreground process: {}'.format(t3 - t1, t3 - t2))

#### Slicing a new MappedSource


Generally, slicing only across channels will return all samples on those channels.

In [None]:
data_ch0 = mapped_source[0:2]
print(type(data_ch0), data_ch0.shape)

However, it's also possible to yield a new mapped source for just those channels. This map will share a data buffer with the original map. The only difference is the `_electrode_channels` set.

In [None]:
with mapped_source.channels_are_maps():
    new_map = mapped_source[0:2]
print(type(new_map), new_map.shape)
print(new_map.data_buffer is mapped_source.data_buffer)

### Writeable sources and mirroring

Array-writing syntax is supported if a mapped source (and data_buffer) is writeable. The present mapped source is *not* writeable, since the file loader never maps a primary source file in read-write mode. Furthermore, since the array shape convention for data sources is (channels, time), a transposed map would *never* be writeable.

To get a writeable source, this source can be mirrored. For example, the primary source would have been mirrored by the loader before applying any filtering. Mirroring has a number of options:

* new_rate_ratio: the mirrored source can be prepared for downsampling if new_rate_ratio > 1
* writeable: the new data source will be writeable (True in most use cases)
* mapped: the new data source will be mapped -- otherwise get a `PlainArraySource`
* channel_compatible: this controls whether the file layout will be the same as the source file, or if the number of channels will only include the active channels
* filename: if a filename is not given, make a temporary file in the temporary file "pool"
* copy: you can copy 'all' signal and aligned arrays, only 'aligned' arrays, or nothing for just empty arrays of the correct shape (copy='')
* if any arrays are pre-allocated for the new mirror, specify them using the "new_sources" parameter

To demonstrate a writeable map, we'll just mirror to a writeable map on a temporary file:

In [None]:
rw_source = mapped_source.mirror(writeable=True, mapped=True, copy='all')
print('Filename:', rw_source.data_buffer.filename)
print('Is writeable:', rw_source.writeable)

Two subtle things about the mirror:

1. It is now a direct map, since we did not choose to keep it channel compatible. That means the active channels directly maps to the h5py.Dataset channels
1. The data buffer is no longer transposed -- that unpleasantness was discretely tidied up by the mirror method.

In [None]:
print('Is direct map:', rw_source.is_direct_map)
print('Map shape:', rw_source.shape, '<---> Buffer shape:', rw_source.data_buffer.shape)

In [None]:
# arbitarily blank the 1st 100 samples on all channels
rw_source[:, :100] = 0
print(rw_source[:5, 98:101])

## Data buffer: low-level array abstractions

*TODO*