# Joining datasets

## This notebook demonstrates loading individual datasets and joining them into combined datasets.

In [None]:
from matplotlib import pyplot as plt

In [None]:
%matplotlib notebook

In [None]:
import numpy as np

from ecogdata.datasource import MappedSource, PlainArraySource
from ecogdata.devices.data_util import load_experiment_auto, join_datasets, load_datasets

from ecogdata.expconfig import available_sessions, session_info, session_conf

from ecogdata.devices.load.file2data import FileLoader

In [None]:
# Get an arbitrary set of tone-stimulated recordings from an awake recording session
session = available_sessions('16017')[1]
info = session_conf(session)

recordings = []
for key in info:
    if 'tones_tab' not in info[key]:
        continue
    if info[key].tones_tab.endswith('txt'):
        recordings.append(key)
print(recordings)

## Experiment recording loading from session config "database"

In [None]:
help(load_experiment_auto)

## Load arguments

The `load_experiment_auto` method delegates loading to specific data-wrangling code for each acquisition system. See modules in `ecogdata.devices.load...`. There are different arguments for each system. Most loaders implement a form of the `ecogdata.devices.load.file2data.FileLoader` class (see doc below). *This is an ongoing migration.*

Some load arguments are specified in `info.session`, and some may be over-ridden in the recording subsections.

Final priority is given to load arguments specified at runtime.

In [None]:
# Using mapped='r+' to ensure read-write access -- this will create a temp file.
dataset = load_experiment_auto(session, recordings[0], mapped='r+')

This dataset has mapped sources (primary is `data`). The other timeseries (`adc` and `aux`) are actually just references to the `aligned_arrays` from the primary `MappedSource`.

In [None]:
print(dataset)
print()
aligned_arrays = [k + ': ' + str(getattr(dataset.data, k)) for k in dataset.data.aligned_arrays]
print('Aligned arrays tracked by dataset.data:')
print('\n'.join(aligned_arrays))

## Simple datasource joining

Data sources (either mapped or loaded) can be joined with `source.join()`.

In [None]:
dataset2 = load_experiment_auto(session, recordings[1], mapped='r+')

The joined set is the simple concatenation of the two sets (with all the aligned arrays appended as well).

In [None]:
joined_dataset = dataset.data.join(dataset2.data)
print(dataset.data.shape, '+', dataset2.data.shape, '=', joined_dataset.shape)

Under the hood, the memory mapping for the two single recording sources is a plain `HDF5Buffer` that mediates smart read/write interfacing with the mapped data file.

In [None]:
type(dataset.data.data_buffer)

The buffer for the joined dataset is a `BufferBinder`. This object does not create a new mapped file, but binds multiple source files together into a single source. This is done by managing hand-offs when indexing between sources.

In [None]:
type(joined_dataset.data_buffer)

In [None]:
plt.figure()
plt.plot(np.arange(200), dataset.data[0, -200:], label='last segment of source 1')
plt.plot(np.arange(200, 400), dataset2.data[0, :200], label='first segment of source 2')
t1 = dataset.data.shape[1]
# BufferBinder can slice thru the two sets using hand-off
plt.plot(np.arange(400), joined_dataset[0, t1 - 200:t1 + 200] + 50, label='spanning segment joined set')
plt.legend()
plt.tight_layout()

## Joining recording datasets
The mapped datasources are easily joined. But metadata like channel maps, sampling rate, and stimulation event timestamps need to be joined as well. Use `ecogdata.devices.data_util.join_datasets` for this.

In [None]:
help(join_datasets)

### join_datasets() only combines channels that are unmasked for each recording

In [None]:
# Apply channel masking to demonstrate map intersecting
mask = dataset.data.binary_channel_mask
mask[:10] = False
dataset.data.set_channel_mask(mask)
dataset.chan_map = dataset.chan_map.subset(mask)
joined_set = join_datasets([dataset, dataset2])

In [None]:
joined_set

Check the lengths of relevant arrays

In [None]:
shapes = [joined_set[attr].shape for attr in ['data', 'adc', 'aux', 'pos_edge']]
shapes.append(len(joined_set.exp))
print(shapes)

In [None]:
joined_set.chan_map.image()

## Use load_datasets to load multiple recordings at the same time

In [None]:
full_dataset = load_datasets(session, recordings[:4], load_kwargs=dict(mapped='r+'))

In [None]:
print(full_dataset)
print()
print('----- Other info -----')
print('Dataset name (joined names):', full_dataset.name)
print('Data shape:', full_dataset.data.shape)
print('Number of tones:', len(full_dataset.exp))

## Other join options

Load the joined set to memory (from mapped)

In [None]:
# join_datasets potentially modifies input -- reset the channel mask on second data source 
dataset2.data.set_channel_mask(None)
joined_set = join_datasets([dataset, dataset2], source_type='loaded', popdata=False)

In [None]:
joined_set

In [None]:
shapes = [joined_set[attr].shape for attr in ['data', 'adc', 'aux', 'pos_edge']]
shapes.append(len(joined_set.exp))
print(shapes)

Join from a mixture of loaded and mapped sources and put the result into a `MappedSource`

In [None]:
dataset2 = load_experiment_auto(session, recordings[1], mapped=False)

In [None]:
joined_set = join_datasets([dataset, dataset2], source_type='mapped')
joined_set

In [None]:
shapes = [joined_set[attr].shape for attr in ['data', 'adc', 'aux', 'pos_edge']]
shapes.append(len(joined_set.exp))
print(shapes)

Join from a mixture of loaded and mapped sources and load the result into a `PlainArraySource`

In [None]:
dataset2 = load_experiment_auto(session, recordings[1], mapped=False)
joined_set = join_datasets([dataset, dataset2], source_type='loaded')
joined_set

In [None]:
shapes = [joined_set[attr].shape for attr in ['data', 'adc', 'aux', 'pos_edge']]
shapes.append(len(joined_set.exp))
print(shapes)