                                                                                                 MSC & JB 2022

# Producers

- [**Imports**](#Imports)
- [**Introduction**](#Introduction)
- [**Creation Routines**](#Creation-Routines)
    - [**Producers From Arrays**](#Producers-From-Arrays)
    - [**Producers From Sequences**](#Producers-From-Sequences)
    - [**Producers From  Files**](#Producers-From-Files)
    - [**Producers From Generating Functions**](#Producers-From-Generating-Functions)
    - [**Producers From Producers**](#Producers-From-Producers)
    - [**Masked Producers**](#Masked-Producers)

## Imports

In [5]:
import sys
import numpy as np
import matplotlib.pyplot as plt

from openseize import producer
from openseize import demos
from openseize.io import edf

## Introduction

><font size=3> **Problem statement** <br/><br/> The size of an EEG  dataset depends on three factors, the number of signals acquired, the sampling rate of each signal and the duration of the measurement. Recent advances in electrode and data acquistion hardware allow for increases in each of these factors such that the resulting dataset may not fit into the virtual (RAM) memory of a user's computer. </font>
>
><font size=3>To address this, openseize uses an iterable object -- the <font color='darkcyan'> **producer, an object that sequentially produces numpy arrays from a data source**</font>. This data source can be a sequence, an ndarray, a file stored to disk, or even a generator function that produces data itself. In this demo, we will cover producer creation routines attributes and methods.</font>

## Creation Routines

> <font size=3> All producers, no matter the data source, are constructed using the produce() function. To see what arguments are needed to build a producer we can look at the producer function documentation.</font>

In [3]:
help(producer)

Help on function producer in module openseize.core.producer:

producer(data, chunksize, axis, shape=None, mask=None, **kwargs)
    Constructs an iterable that produces ndarrays of length chunksize
    along axis during iteration.
    
    This constructor returns an object that is capable of producing ndarrays
    or masked ndarrays during iteration from a single ndarray, a sequence of
    ndarrays, a file Reader instance (see io.bases.Reader), an ndarray 
    generating function, or a pre-existing producer of ndarrays. The 
    produced ndarrays from this object will have length chunksize along axis.
    
    Args:
        data:
            An object from which ndarrays will be produced from. Supported
            types are Reader instances, ndarrays, sequence of ndarrays, 
            generating functions yielding ndarrays, or a producer of 
            ndarrays. For sequences and generator functions it is
            required that each subarray has the same shape along all axes 
   

> <font size=3>So the producer function needs a <font color='darkcyan'>data</font> source, a <font color='darkcyan'>chunksize</font> describing the number of samples that should be included in each produced subarray, the <font color='darkcyan'>axis</font> along which samples lie and <font color='darkcyan'>possibly a shape and mask</font>. We will cover each of these parameters in detail in this demo.</font>

### Producers From Arrays

> <font size=3>To create a producer from an array may seem silly. *Isn't the array already in memory?* Well, yes it is but maybe that array is consuming a lot of your memory and you can't do anything with the array. By creating a producer, you can work with the produced values using any of the openseize functions (downsample, filter, etc) while still holding the array in-memory. <br/><br/> **Let's make an array and then create a producer to demonstrate this utility.**</font>

In [13]:
# create a reproducible random data array with 4 channels and 1 million samples along axis=1
rng = np.random.default_rng(1234)
data = rng.random((4, 1000000))

# lets also print data's memory consumption
print('data is using = {} MB'.format((data.size * data.itemsize)/1e6))

data is using = 32.0 MB


In [27]:
# build a producer declaring that we want the producer to yield arrays of size 300000 
# using the samples along the last axis
pro = producer(data, chunksize=300000, axis=-1)

# lets checkout the producer's memory consumption
print('pro is using = {} Bytes'.format(sys.getsizeof(pro)))

pro is using = 48 Bytes


><font size=3>This is the first important point about producers. <font color='darkcyan'>Producers do not store data, they are iterables that know how to yield data to you on-the-fly.</font> **Let's see what the producers attributes are.**</font>

In [28]:
print(pro)

ArrayProducer Object
---Attributes & Properties---
{'data': array([[0.97669977, 0.38019574, 0.92324623, ..., 0.02049864, 0.84033509,
        0.07061386],
       [0.32584251, 0.01559622, 0.16734471, ..., 0.48613722, 0.13466647,
        0.78129557],
       [0.45169665, 0.44011763, 0.0325013 , ..., 0.86914401, 0.5904367 ,
        0.4616979 ],
       [0.84830865, 0.97995714, 0.63405179, ..., 0.7236714 , 0.80536627,
        0.77495984]]),
 'axis': -1,
 'kwargs': {},
 'chunksize': 300000,
 'shape': (4, 1000000)}

Type help(ArrayProducer) for full documentation


><font size=3>The producer instance is holding a reference to the data array, the sample axis, the chunksize of subarrays that will be produced and the shape of the referenced data. Let's try to get each subarray from the producer. <font color='firebrick'>*Wait.. how do we do that?*</font>. 
    <br/>
    <br/>Since the producer is an iterable, you can access each subarray just like any iterable, Any method that triggers python's iteration protocol will give you the subarrays in the producer. This could be a for-loop, a list comprehension, or an explicit call to the iter and next builtin methods. **Lets access each produced array in a loop.** </font>

In [30]:
# loop to access each subarray printing it's shape and first 5 of samples for each channel
for idx, subarr in enumerate(pro):
    print('Array {}, shape={}'.format(idx, subarr.shape))
    print(subarr[:, :5])

Array 0, shape=(4, 300000)
[[0.97669977 0.38019574 0.92324623 0.26169242 0.31909706]
 [0.32584251 0.01559622 0.16734471 0.12427613 0.25749222]
 [0.45169665 0.44011763 0.0325013  0.02906749 0.20707769]
 [0.84830865 0.97995714 0.63405179 0.71921724 0.34165105]]
Array 1, shape=(4, 300000)
[[0.19975295 0.38469445 0.31663237 0.32026263 0.85713905]
 [0.68094421 0.67678136 0.02969927 0.90235448 0.79731081]
 [0.3700237  0.60763138 0.04216831 0.57699506 0.04456521]
 [0.54071085 0.82855925 0.09775676 0.03968656 0.65453465]]
Array 2, shape=(4, 300000)
[[0.92858655 0.05528663 0.88124263 0.28606888 0.54164412]
 [0.95592965 0.80143229 0.09263899 0.72895997 0.85988591]
 [0.7104101  0.58855675 0.11348623 0.5171883  0.90972664]
 [0.48743344 0.00490091 0.20384552 0.91139126 0.04721849]]
Array 3, shape=(4, 100000)
[[0.78483056 0.93115015 0.41382943 0.38030702 0.75412888]
 [0.4725766  0.14425412 0.15515715 0.71459954 0.30351422]
 [0.34821652 0.89459182 0.1399783  0.21133067 0.58058115]
 [0.78146378 0.0234

><font size=3>Be sure not to miss that the last array the producer yielded was smaller than the previous 3. Why? Remember the data shape is (4, 1e6) and 1e6 is not perfectly divisible by 300,000. In fact, the last array yielded is of course 1e6 % 300,000 = 100,000 samples long. Important question **Is the producer exhausted?**

In [31]:
# test if producer can produce again
for idx, subarr in enumerate(pro):
    print('Array {}, shape={}'.format(idx, subarr.shape))

Array 0, shape=(4, 300000)
Array 1, shape=(4, 300000)
Array 2, shape=(4, 300000)
Array 3, shape=(4, 100000)


><font size=3>This is critical, <font color='darkcyan'>the producer is an iterable not a one-shot iterator. It can go through the data as many times as you need.</font> 
    <br/><br/> Now if you are skeptical (like any good scientist) you are probably wondering. *How do I know that the produced values **exactly** match the original data source.* **Let's demonstrate that all the produced data exactly matches the original data source.**</font>

In [35]:
# demonstrate that the  produced arrays match the original data source 'data'
# concatenate all produced subarrays along the last sample axis.
produced_array = np.concatenate([subarr for subarr in pro], axis=1)

#now test if the combined produced arrays match the original data array
print('Fingers crossed.. Do they match? -> {}'.format(np.allclose(produced_array, data)))

Fingers crossed.. Do they match? -> True


> <font size=3> Our method of testing array equality required us to concatenate the produced arrays. Since converting a producer to an ndarry is likely something you'll need often, it is a formal method of each producer instance called  *to_array*. **Let's call this important method and repeat our test.**</font>

In [36]:
# demonstrate that the  produced arrays match the original data source 'data'
# concatenate all produced subarrays along the last sample axis using the producer's to_array method.
produced_array = pro.to_array()

#now test if the combined produced arrays match the original data array
print('Match? -> {}'.format(np.allclose(produced_array, data)))

Match? -> True


><font size=3>Of course this was just one test. If you need to see more tests to be convinced, please see openseize.core.tests.producer_tests for the formal pytesting.</font>

## Producers From Sequences

><font size=3>As you might guess from our discussion on producers built from arrays, producers can be built from any sequence. **Let's show that producers can be built from sequences.**</font>

In [7]:
# make a fun sequence from monty-python
my_seq = ["(Knight) Tis but a scratch.",
          "(Arthur) A scratch? Your arm's off!",
          "(Knight) No, it isn't.",
          "(Arthur) Well, what's that then?"]
# convert it to a scene producer
scene = producer(my_seq, chunksize=1, axis=-1)

In [9]:
# play the scene out
for dialog in scene:
    print(dialog)

['(Knight) Tis but a scratch.']
["(Arthur) A scratch? Your arm's off!"]
["(Knight) No, it isn't."]
["(Arthur) Well, what's that then?"]


><font size=3>There is no restriction on the datatype that can be produced as the above snippet demonstrates. As such, you might find producers useful for other large tasks that you need to break into subproblems.</font>

## Producers From Files

><font size=3>Producing data from a file stored to disk that is too large to fit into virtual memory is one of the  most important use cases for producers. Here we are going to open a European data format(+) binary file type and produce arrays from it. A detailed demo of this important file reader can be found in the file_reading demo.</font>

### Demo Data

><font size=3>In order to produce from a file, we will need demo data. Openseize includes a sample edf file called *recording_001.edf*. <font color='darkcyan'>Where is this file?</font>
>
><font size=3>When we imported the demos module, we got a paths object that has two methods. The <font color='firebrick'>*available*</font> method list all the datasets available in the local demos/data directory as well as the demo data available in a remote Zenodo repository.The <font color='firebrick'>*locate*</font> method will return a local filepath to a named dataset. To do this, locate may need to download the data first depending on whether the data file is already on your system.</font> 

In [3]:
# check out the available demo data including with openseize
# this will include any local data in demos/data and remote data stored at openseizes Zenodo repository.
demos.paths.available

---Available demo data files & location---
------------------------------------------
alignments.pkl                 'https://zenodo.or...5ad/alignments.pkl'
behavior_df.pkl                'https://zenodo.or...ad/behavior_df.pkl'
correlated_pairs_df.pkl        'https://zenodo.or...lated_pairs_df.pkl'
dredd_behavior.pkl             'https://zenodo.or...dredd_behavior.pkl'
dredd_behavior.xlsx            'https://zenodo.or...redd_behavior.xlsx'
dredd_freezes_df.pkl           'https://zenodo.or...edd_freezes_df.pkl'
high_degree_df.pkl             'https://zenodo.or...high_degree_df.pkl'
N006_wt_basis.npz              'https://zenodo.or.../N006_wt_basis.npz'
N006_wt_cxtbasis.pkl           'https://zenodo.or...06_wt_cxtbasis.pkl'
N006_wt_cxtsources.pkl         'https://zenodo.or..._wt_cxtsources.pkl'
N006_wt_rois.pkl               'https://zenodo.or...d/N006_wt_rois.pkl'
N006_wt_sources.npy            'https://zenodo.or...006_wt_sources.npy'
N019_wt_basis.npz              'https://zenodo.or.

In [4]:
# so we see the recording_001.edf is on Zenodo but not in our local data
# this should open a confirmation box and start your download
recording_path = demos.paths.locate('recording_001.edf')

### Building a data Reader

><font size=3>Ok, so we have a path to a data file. <font color='darkcyan'>So can we make a producer using this path?</font> No, openseize is a highly extensible package that will support many  different file types. Each of these file types will need to be read according to its own protocol. For reading EDF files, openseize has an EDF reader that reads... well EDF files. We discuss file readers in the file_reading demo. For now, we will make the EDF reader and explain the steps as we go along.</font>

In [7]:
# how do  we build an edf reader -- ask for help!
help(edf.Reader)

Help on class Reader in module openseize.io.edf:

class Reader(openseize.io.bases.Reader)
 |  Reader(path)
 |  
 |  A reader of European Data Format (EDF/EDF+) files.
 |  
 |  The EDF specification has a header section followed by data records
 |  Each data record contains all signals stored sequentially. EDF+
 |  files include an annotation signal within each data record. To
 |  distinguish these signals we refer to data containing signals as
 |  channels and annotation signals as annotation. Currently, this reader
 |  does not support the reading of annotation signals.
 |  
 |  For details on the EDF/+ file specification please see:
 |  
 |  https://www.edfplus.info/specs/index.html
 |  
 |  Attributes:
 |      header: A dictionary representation of an EDF Header.
 |      shape: A tuple of channels, samples contained in this EDF
 |  
 |  Method resolution order:
 |      Reader
 |      openseize.io.bases.Reader
 |      abc.ABC
 |      openseize.core.mixins.ViewInstance
 |      builtin

><font size=3>Under the methods section we see that to make a reader all we need is an edf path.</font>

In [8]:
# build the reader using the recording path we fetched
reader = edf.Reader(recording_path)

><font size=3>Also notice under the methods there is just one method called *read*. It reads samples from the EDF file between start and stop sample indices for each channel in channels list. Lets see if this does something.</font>

In [11]:
values = reader.read(start=0, stop=10) #channels unspecified means read all channels
print(values)

[[-19.87908032   7.95793213  19.88808032  18.89390131  18.89390131
   43.74837671   6.96375311  30.8240495    5.9695741   22.87061737]
 [-86.4890744   51.70180884  63.63195703  88.48643243  63.63195703
   61.643599    54.68434589  43.74837671  55.6785249   64.62613605]
 [-85.49489539  44.74255573  29.82987048  79.53882129  52.69598785
   42.75419769  42.75419769  21.87643835  60.64941998  66.61449408]
 [ 62.63777802  95.44568555  77.55046326  36.7891236  109.36419177
  118.31180292 122.28851898 115.32926587  76.55628424  44.74255573]]


><font size=3>So we got an array of shape (4,10) back. <font color='darkcyan'>***Is that right?***</font>

In [12]:
# examine the reader with print
print(reader)

Reader Object
---Attributes & Properties---
{'path': PosixPath('/home/matt/python/nri/openseize/demos/data/recording_001.edf'),
 'header': {'version': '0',
            'patient': 'PIN-42 M 11-MAR-1952 Animal',
            'recording': 'Startdate 15-AUG-2020 X X X',
            'start_date': '15.08.20',
            'start_time': '09.59.15',
            'header_bytes': 1536,
            'reserved_0': 'EDF+C',
            'num_records': 3775,
            'record_duration': 1.0,
            'num_signals': 5,
            'names': ['EEG EEG_1_SA-B', 'EEG EEG_2_SA-B', 'EEG EEG_3_SA-B',
                      'EEG EEG_4_SA-B', 'EDF Annotations'],
            'transducers': ['8401 HS:15279', '8401 HS:15279', '8401 HS:15279',
                            '8401 HS:15279', ''],
            'physical_dim': ['uV', 'uV', 'uV', 'uV', ''],
            'physical_min': [-8144.31, -8144.31, -8144.31, -8144.31, -1.0],
            'physical_max': [8144.319, 8144.319, 8144.319, 8144.319, 1.0],
            'dig

><font size=3>Awesome! printing the reader gives us the EDFs header which tells us everything we need to know. The file contains 4 channels named [EEG EEG_1_SA-B,... EEG EEG_4_SA-B]. So our array shape makes sense</font>

><font size=3><font color='darkcyan'>**Can we finally build a producer? YES**</font>, producers can produce from any reader type as long as the reader has a method called <font color='firebrick'>**read**</font>. Lucky for us the EDF reader we just built has a read method</font>