## Imports

In [1]:
import sys
import numpy as np
import matplotlib.pyplot as plt

from openseize import producer
from openseize.core.producer import as_producer
from openseize import demos
from openseize.file_io import edf, annotations

## Introduction

<font size=3>The size of an EEG  dataset depends on three factors, the number of signals acquired, the sampling rate of each signal and the duration of the measurement. Recent advances in electrode and data acquistion hardware allow for increases in each of these factors such that the resulting dataset may not fit into the virtual (RAM) memory of a user's computer.</font>

<font size=3>To address this, openseize uses an iterable object -- the <font color='darkcyan'> <b>producer, an object that sequentially produces numpy arrays from a data source</b></font>. This data source can be a sequence, an ndarray, a file stored to disk, or even a generator function that produces data itself. In this demo, we will cover producer creation routines, attributes, and methods.

## Creation Routines

<font size=3> All producers, no matter the data source, are constructed using the produce() function. To see what arguments are needed to build a producer, we can look at the producer function documentation.</font>

In [2]:
help(producer)

Help on function producer in module openseize.core.producer:

producer(data: Union[numpy.ndarray[Any, numpy.dtype[+ScalarType]], Iterable[numpy.ndarray[Any, numpy.dtype[+ScalarType]]], openseize.file_io.edf.Reader, Callable, ForwardRef('Producer')], chunksize: int, axis: int, shape: Optional[Sequence[int]] = None, mask: Optional[numpy.ndarray[Any, numpy.dtype[numpy.bool_]]] = None, **kwargs) -> 'Producer'
    Constructs an iterable that produces ndarrays of length chunksize
    along axis during iteration.
    
    This constructor returns an object that is capable of producing ndarrays
    or masked ndarrays during iteration from a single ndarray, a sequence of
    ndarrays, a file Reader instance (see io.bases.Reader), an ndarray
    generating function, or a pre-existing producer of ndarrays. The
    produced ndarrays from this object will have length chunksize along axis.
    
    Args:
        data:
            An object from which ndarrays will be produced from. Supported
       

<font size=3>So the producer function needs a <font color='darkcyan'>data</font> source, a <font color='darkcyan'>chunksize</font> describing the number of samples that should be included in each produced subarray, the <font color='darkcyan'>axis</font> along which samples lie, and <font color='darkcyan'>possibly a shape and mask</font>. We will cover each of these parameters in detail in this demo.

### Producers From Arrays

<font size=3>To create a producer from an array may seem silly. <i>Isn't the array already in memory?</i> Well, yes it is but maybe that array is consuming a lot of your memory and you can't do anything with the array. By creating a producer, you can work with the produced values using any of the openseize functions (downsample, filter, etc) while still holding the array in-memory. <br/><br/> <b>Let's make an array and then create a producer to demonstrate this utility.</b></font>

In [3]:
# create a reproducible random data array with 4 channels and 1 million samples along axis=1
rng = np.random.default_rng(1234)
data = rng.random((4, 1000000))

# lets also print data's memory consumption
print('data is using = {} MB'.format((data.size * data.itemsize)/1e6))

data is using = 32.0 MB


In [4]:
# build a producer declaring that we want the producer to yield arrays of size 300000 
# using the samples along the last axis
pro = producer(data, chunksize=300000, axis=-1)

# lets checkout the producer's memory consumption
print('pro is using = {} Bytes'.format(sys.getsizeof(pro)))

pro is using = 48 Bytes


<font size=3>This is the first important point about producers. <font color='darkcyan'><b>Producers do not store data, they are iterables that know how to yield data to you on-the-fly.</b></font> Let's see what the producers attributes are.

In [5]:
print(pro)

ArrayProducer Object
---Attributes & Properties---
{'data': array([[0.97669977, 0.38019574, 0.92324623, ..., 0.02049864, 0.84033509,
        0.07061386],
       [0.32584251, 0.01559622, 0.16734471, ..., 0.48613722, 0.13466647,
        0.78129557],
       [0.45169665, 0.44011763, 0.0325013 , ..., 0.86914401, 0.5904367 ,
        0.4616979 ],
       [0.84830865, 0.97995714, 0.63405179, ..., 0.7236714 , 0.80536627,
        0.77495984]]),
 'axis': -1,
 'kwargs': {},
 'chunksize': 300000,
 'shape': (4, 1000000)}

Type help(ArrayProducer) for full documentation


<font size=3>The producer instance is holding a reference to the data array, the sample axis, the chunksize of subarrays that will be produced, and the shape of the referenced data. Let's try to get each subarray from the producer. <font color='firebrick'><i>Wait.. how do we do that?</i></font>. 
<br>
<br>
Since the producer is an iterable, you can access each subarray just like any iterable, Any method that triggers python's iteration protocol will give you the subarrays in the producer. This could be a for-loop, a list comprehension, or an explicit call to the iter and next builtin methods. <b>Lets access each produced array in a loop.</b> 

In [6]:
# loop to access each subarray printing it's shape and first 5 of samples for each channel
for idx, subarr in enumerate(pro):
    print('Array {}, shape={}'.format(idx, subarr.shape))
    print(subarr[:, :5])

Array 0, shape=(4, 300000)
[[0.97669977 0.38019574 0.92324623 0.26169242 0.31909706]
 [0.32584251 0.01559622 0.16734471 0.12427613 0.25749222]
 [0.45169665 0.44011763 0.0325013  0.02906749 0.20707769]
 [0.84830865 0.97995714 0.63405179 0.71921724 0.34165105]]
Array 1, shape=(4, 300000)
[[0.19975295 0.38469445 0.31663237 0.32026263 0.85713905]
 [0.68094421 0.67678136 0.02969927 0.90235448 0.79731081]
 [0.3700237  0.60763138 0.04216831 0.57699506 0.04456521]
 [0.54071085 0.82855925 0.09775676 0.03968656 0.65453465]]
Array 2, shape=(4, 300000)
[[0.92858655 0.05528663 0.88124263 0.28606888 0.54164412]
 [0.95592965 0.80143229 0.09263899 0.72895997 0.85988591]
 [0.7104101  0.58855675 0.11348623 0.5171883  0.90972664]
 [0.48743344 0.00490091 0.20384552 0.91139126 0.04721849]]
Array 3, shape=(4, 100000)
[[0.78483056 0.93115015 0.41382943 0.38030702 0.75412888]
 [0.4725766  0.14425412 0.15515715 0.71459954 0.30351422]
 [0.34821652 0.89459182 0.1399783  0.21133067 0.58058115]
 [0.78146378 0.0234

<font size=3>Be sure not to miss that the last array the producer yielded was smaller than the previous 3. Why? Remember the data shape is (4, 1e6) and 1e6 is not perfectly divisible by 300,000. In fact, the last array yielded is of course 1e6 % 300,000 = 100,000 samples long. Important question <b>Is the producer exhausted?</b>

In [7]:
# test if producer can produce again
for idx, subarr in enumerate(pro):
    print('Array {}, shape={}'.format(idx, subarr.shape))

Array 0, shape=(4, 300000)
Array 1, shape=(4, 300000)
Array 2, shape=(4, 300000)
Array 3, shape=(4, 100000)


<font size=3>This is critical: <font color='darkcyan'><b>the producer is an iterable not a one-shot iterator. It can go through the data as many times as you need.</b></font> 
<br>
<br> 
Now if you are skeptical (like any good scientist) you are probably wondering. <i>How do I know that the produced values <b>exactly</b> match the original data source.</i> <b>Let's demonstrate that all the produced data exactly matches the original data source.</b>

In [8]:
# demonstrate that the  produced arrays match the original data source 'data'
# concatenate all produced subarrays along the last sample axis.
produced_array = np.concatenate([subarr for subarr in pro], axis=1)

#now test if the combined produced arrays match the original data array
print('Fingers crossed.. Do they match? -> {}'.format(np.allclose(produced_array, data)))

Fingers crossed.. Do they match? -> True


<font size=3> Our method of testing array equality required us to concatenate the produced arrays. Since converting a producer to an ndarray is likely something you'll need often, it is a formal method of each producer instance called  <i>to_array</i>. <b>Let's call this important method and repeat our test.</b>

In [9]:
# demonstrate that the  produced arrays match the original data source 'data'
# concatenate all produced subarrays along the last sample axis using the producer's to_array method.
produced_array = pro.to_array()

#now test if the combined produced arrays match the original data array
print('Match? -> {}'.format(np.allclose(produced_array, data)))

Match? -> True


<font size=3>Of course this was just one test. If you need to see more tests to be convinced, please see openseize.core.tests.producer_tests for the formal pytesting.</font>

### Producers From Sequences

<font size=3>As you might guess from our discussion on producers built from arrays, producers can be built from any sequence. <b>Let's show that producers can be built from sequences.</b></font>

In [10]:
# make a fun sequence from monty-python
my_seq = [["(Knight) Tis but a scratch."],
          ["(Arthur) A scratch? Your arm's off!"],
          ["(Knight) No, it isn't."],
          ["(Arthur) Well, what's that then?"]]
# convert it to a scene producer
scene = producer(my_seq, chunksize=1, axis=-1)

In [11]:
# play the scene out
for dialog in scene:
    print(dialog)

['(Knight) Tis but a scratch.']
["(Arthur) A scratch? Your arm's off!"]
["(Knight) No, it isn't."]
["(Arthur) Well, what's that then?"]


<font size=3>There is no restriction on the datatype that can be produced as the above snippet demonstrates. As such, you might find producers useful for other large tasks that you need to break into subproblems.</font>

### Producers From Files

<font size=3>Producing data from a file stored to disk that is too large to fit into virtual memory is one of the  most important use cases for producers. Here we are going to open a European data format(+) binary file type and produce arrays from it. A detailed demo of this important file reader can be found in the file_reading demo.</font>

#### Demo Data

<font size=3>In order to produce from a file, we will need demo data. Openseize includes a sample edf file called <i>recording_001.edf</i>. <font color='firebrick'><b>Where is this file?</b></font>

<font size=3>When we imported the demos module, we got a paths object that has two methods. The <font color='firebrick'><i><b>available</b></i></font> method lists all the datasets available in the local demos/data directory as well as the demo data available in a remote Zenodo repository.The <font color='firebrick'><i><b>locate</b></i></font> method will return a local filepath to a named dataset. To do this, locate may need to download the data first depending on whether the data file is already on your system. 

In [12]:
# check out the available demo data including with openseize
# this will include any local data in demos/data and remote data stored at openseizes Zenodo repository.
demos.paths.available

---Available demo data files & location---
------------------------------------------
annotations_001.txt            '/home/matt/python...nnotations_001.txt'
recording_001.edf              '/home/matt/python.../recording_001.edf'
5872_Left_group A.txt          '/home/matt/python...2_Left_group A.txt'
split0.edf                     '/home/matt/python...os/data/split0.edf'
5872_Left_group A-D.edf        '/home/matt/python...Left_group A-D.edf'
irregular_write_test.edf       '/home/matt/python...lar_write_test.edf'
write_test.edf                 '/home/matt/python...ata/write_test.edf'
CW0259_SWDs.npy                '/home/matt/python...ta/CW0259_SWDs.npy'
subset_001.edf                 '/home/matt/python...ata/subset_001.edf'
split1.edf                     '/home/matt/python...os/data/split1.edf'


In [13]:
# if not in on your machine, a confirmation box will open to confirm your download
recording_path = demos.paths.locate('recording_001.edf')

In [14]:
# now confirm the recording_001.edf file is on your machine
demos.paths.available

---Available demo data files & location---
------------------------------------------
annotations_001.txt            '/home/matt/python...nnotations_001.txt'
recording_001.edf              '/home/matt/python.../recording_001.edf'
5872_Left_group A.txt          '/home/matt/python...2_Left_group A.txt'
split0.edf                     '/home/matt/python...os/data/split0.edf'
5872_Left_group A-D.edf        '/home/matt/python...Left_group A-D.edf'
irregular_write_test.edf       '/home/matt/python...lar_write_test.edf'
write_test.edf                 '/home/matt/python...ata/write_test.edf'
CW0259_SWDs.npy                '/home/matt/python...ta/CW0259_SWDs.npy'
subset_001.edf                 '/home/matt/python...ata/subset_001.edf'
split1.edf                     '/home/matt/python...os/data/split1.edf'


#### Building Data Readers

<font size=3>Ok, so we have a path to a data file. <font color='darkcyan'><b>So can we make a producer using this path?</b></font> No, openseize is a highly extensible package that will support many  different file types. Each of these file types will need to be read according to its own protocol. For reading EDF files, openseize has an EDF reader that reads... well EDF files. We discuss file readers in the file_reading demo. For now, we will make the EDF reader and explain the steps as we go along.

In [15]:
# how do  we build an edf reader -- ask for help!
help(edf.Reader)

Help on class Reader in module openseize.file_io.edf:

class Reader(openseize.file_io.bases.Reader)
 |  Reader(path: Union[str, pathlib.Path]) -> None
 |  
 |  A reader of European Data Format (EDF/EDF+) files.
 |  
 |  This reader supports reading EEG data and metadata from an EDF file with
 |  and without context management (see Introduction). If opened outside
 |  of context management, you should close this Reader's instance manually
 |  by calling the 'close' method to recover open file resources when you
 |  finish processing a file.
 |  
 |  Attributes:
 |      header (dict):
 |          A dictionary representation of the EDFs header.
 |      shape (tuple):
 |          A (channels, samples) shape tuple.
 |      channels (Sequence):
 |          The channels to be returned from the 'read' method call.
 |  
 |  Examples:
 |      >>> from openseize.demos import paths
 |      >>> filepath = paths.locate('recording_001.edf')
 |      >>> from openseize.io.edf import Reader
 |      >>> 

<font size=3>Under the 'Methods' section we see that to make a reader all we need is an edf path.</font>

In [16]:
# build the reader using the recording path we fetched
reader = edf.Reader(recording_path)

<font size=3>Also notice under the methods there is just one method called <i>read</i>. It reads samples from the EDF file between start and stop sample indices for each channel in channels list. Lets see if this does something.</font>

In [17]:
values = reader.read(start=0, stop=10) #channels unspecified means read all channels
print(values)

[[-19.87908032   7.95793213  19.88808032  18.89390131  18.89390131
   43.74837671   6.96375311  30.8240495    5.9695741   22.87061737]
 [-86.4890744   51.70180884  63.63195703  88.48643243  63.63195703
   61.643599    54.68434589  43.74837671  55.6785249   64.62613605]
 [-85.49489539  44.74255573  29.82987048  79.53882129  52.69598785
   42.75419769  42.75419769  21.87643835  60.64941998  66.61449408]
 [ 62.63777802  95.44568555  77.55046326  36.7891236  109.36419177
  118.31180292 122.28851898 115.32926587  76.55628424  44.74255573]]


<font size=3>So we got an array of shape (4,10) back. <font color='darkcyan'><b><i>Is that right?</b></i></font>

In [18]:
# examine the reader with print
print(reader)

Reader Object
---Attributes & Properties---
{'path': PosixPath('/home/matt/python/nri/openseize/src/openseize/demos/data/recording_001.edf'),
 'mode': 'rb',
 'kwargs': {},
 'header': {'version': '0',
            'patient': 'PIN-42 M 11-MAR-1952 Animal',
            'recording': 'Startdate 15-AUG-2020 X X X',
            'start_date': '15.08.20',
            'start_time': '09.59.15',
            'header_bytes': 1536,
            'reserved_0': 'EDF+C',
            'num_records': 3775,
            'record_duration': 1.0,
            'num_signals': 5,
            'names': ['EEG EEG_1_SA-B', 'EEG EEG_2_SA-B', 'EEG EEG_3_SA-B',
                      'EEG EEG_4_SA-B', 'EDF Annotations'],
            'transducers': ['8401 HS:15279', '8401 HS:15279', '8401 HS:15279',
                            '8401 HS:15279', ''],
            'physical_dim': ['uV', 'uV', 'uV', 'uV', ''],
            'physical_min': [-8144.31, -8144.31, -8144.31, -8144.31, -1.0],
            'physical_max': [8144.319, 8144.319

<font size=3>Awesome! Printing the reader gives us the EDFs header which tells us everything we need to know. The file contains 4 channels named [EEG EEG_1_SA-B,... EEG EEG_4_SA-B]. So our array shape makes sense.</font>

<font size=3><font color='darkcyan'><b>Can we finally build a producer? YES</b></font>, producers can produce from any reader type as long as the reader has a method called <font color='firebrick'><b>read</b></font>. Lucky for us the EDF reader we just built has a read method.

In [19]:
# build a producer from our edf reader instance
# the  chunksize will  be set to 100k samples
# the axis should be 1 since the reader has samples along this axis
rpro = producer(reader, chunksize=100000, axis=1)

In [20]:
# lets check out the data and attributes of this producer
print(rpro)

ReaderProducer Object
---Attributes & Properties---
{'data': Reader(path: Union[str, pathlib.Path]) -> None,
 'axis': 1,
 'kwargs': {},
 'chunksize': 100000,
 'shape': (4, 18875000)}

Type help(ReaderProducer) for full documentation


In [21]:
# and lets checkout the producer's memory consumption
print('pro is using = {} Bytes'.format(sys.getsizeof(pro)))

pro is using = 48 Bytes


In [22]:
# how much memory would we use if we loaded all the data in at once?
size = np.multiply(*rpro.shape)
itemsize = 8 #8 bytes in a float64
print('data would be = {} MB'.format((size * itemsize)/1e6))

data would be = 604.0 MB


In [23]:
# lets also get the shape of each produced array
shapes = [arr.shape for arr in rpro]
print(shapes)

[(4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000), (4, 100000)

<font size=3>So this ReaderProducer is giving us access to a little over 600 MB of data using only 48 bytes!
<b>Lets  test to make sure the produced values equal the values from the file read in as a single array.</b></font>

In [24]:
# This will consume 604 MB of data
produced_array = rpro.to_array()

# test if this matches the array from reading all the data
read_array = reader.read(0, stop=None) #if stop is None the reader reads to the end of file

print('Do they match...', np.allclose(produced_array, read_array))

Do they match... True


### Masked Producers

#### Masking with Annotations

<font size=3>While the producer will produce sequential chunks of data from the start of file to the end of a file, it is common that researchers only want to analyze specific sections of the data. To support non-contiguous production of arrays from a data source, producers can be initialized with a mask. <b>Let's see how to use an annotation file to create a mask so that the producer only produces data we want.</b></font>

In [25]:
# again lets check the available demo data
demos.paths.available

---Available demo data files & location---
------------------------------------------
annotations_001.txt            '/home/matt/python...nnotations_001.txt'
recording_001.edf              '/home/matt/python.../recording_001.edf'
5872_Left_group A.txt          '/home/matt/python...2_Left_group A.txt'
split0.edf                     '/home/matt/python...os/data/split0.edf'
5872_Left_group A-D.edf        '/home/matt/python...Left_group A-D.edf'
irregular_write_test.edf       '/home/matt/python...lar_write_test.edf'
write_test.edf                 '/home/matt/python...ata/write_test.edf'
CW0259_SWDs.npy                '/home/matt/python...ta/CW0259_SWDs.npy'
subset_001.edf                 '/home/matt/python...ata/subset_001.edf'
split1.edf                     '/home/matt/python...os/data/split1.edf'


In [26]:
# lets fetch the annotations_001.edf. This contains user annotations for recording_001.edf
filepath = demos.paths.locate('annotations_001.txt')

In [27]:
# lets have a look at this file's contents
with open(filepath, 'r') as infile:
    for idx, line in enumerate(infile):
        print(idx, line)

0 Experiment ID	Experiment

1 Animal ID	Animal

2 Researcher	Test

3 Directory path	

4 

5 

6 Number	Start Time	End Time	Time From Start	Channel	Annotation

7 0	08/15/20 09:59:15.215	08/15/20 09:59:15.215	0.0000	ALL	Started Recording

8 1	08/15/20 10:00:00.000	08/15/20 10:00:00.000	44.7850	ALL	Qi_start

9 2	08/15/20 10:00:25.000	08/15/20 10:00:30.000	69.7850	ALL	grooming

10 3	08/15/20 10:00:45.000	08/15/20 10:00:50.000	89.7850	ALL	grooming

11 4	08/15/20 10:02:15.000	08/15/20 10:02:20.000	179.7850	ALL	grooming

12 5	08/15/20 10:04:36.000	08/15/20 10:04:41.000	320.7850	ALL	exploring

13 6	08/15/20 10:05:50.000	08/15/20 10:05:55.000	394.7850	ALL	exploring

14 7	08/15/20 10:08:50.000	08/15/20 10:08:55.000	574.7850	ALL	rest

15 8	08/15/20 10:10:14.000	08/15/20 10:10:19.000	658.7850	ALL	exploring

16 9	08/15/20 10:17:10.000	08/15/20 10:17:15.000	1074.7850	ALL	rest

17 10	08/15/20 10:35:49.000	08/15/20 10:35:54.000	2193.7850	ALL	rest

18 11	08/15/20 10:40:00.000	08/15/20 10:40:00.000	2444

<font  size=3>This file contains 13 annotations; these annotations include user start and stop, rest, grooming and exploring. Our goal is to use these annotations to build a producer that produces values for only rest and exploring times.</font>

<font  size=3>Openseize has a set of annotation file readers. This file is in the Pinnacle format so we will use the Pinnacle annotations reader (this is described further in the file_reading demo).
</font>

In [28]:
# for details on reading Pinnacle files see the file_reading demo

# Pinnacle files need a path & the column header line (start)
with annotations.Pinnacle(path=filepath, start=6) as areader:
    # read creates a list of annotation dataclass instances that contain the label, 
    # time (in secs), duration (in secs) and the channel(s) the annotation was marked 
    # for.
    annotes = areader.read(labels=['rest','exploring'])

# lets print each annotation instance in the read_annotes list
for an in annotes:
    print(an)
    
#you can see the times match the time from the start of recording in the text file.

Annotation(label='exploring', time=320.785, duration=5.0, channel='ALL')
Annotation(label='exploring', time=394.785, duration=5.0, channel='ALL')
Annotation(label='rest', time=574.785, duration=5.0, channel='ALL')
Annotation(label='exploring', time=658.785, duration=5.0, channel='ALL')
Annotation(label='rest', time=1074.785, duration=5.0, channel='ALL')
Annotation(label='rest', time=2193.785, duration=5.0, channel='ALL')


<font size=3 color='darkcyan'><b>So how do we take these Annotation dataclass instances and turn them into a mask for a producer?</b></font>

The annotations module in openseize has a function called <font color='firebrick'><i>as_mask</i></font>

In [29]:
help(annotations.as_mask)

Help on function as_mask in module openseize.file_io.annotations:

as_mask(annotations: Sequence[openseize.file_io.bases.Annotation], size: int, fs: float, include: bool = True) -> numpy.ndarray[typing.Any, numpy.dtype[numpy.bool_]]
    Creates a boolean mask from a sequence of annotation dataclass
    instances..
    
    Producers of EEG data may receive an optional boolean array mask.  This
    function creates a boolean mask from a sequence of annotations and is
    therefore useful for filtering EEG data by annotation label during
    processing.
    
    Args:
        annotations:
            A sequence of annotation dataclass instances to convert to a 
            mask.
        size:
            The length of the boolean array to return.
        fs:
            The sampling rate in Hz of the digital system.
        include:
            Boolean determining if annotations should be set to True or
            False in the returned array. True means all values
            are False 

<font size=3>To make a mask we need the annotation dataclasses, the length of the mask, and the sampling rate. The sampling rate is needed to convert the time of the annotation into a sample number.</font>

In [30]:
# make a mask the same size as the number of samples in the reader & set values at the annotes to True
mask = annotations.as_mask(annotes, size=reader.shape[-1], fs=5000, include=True)
print(mask.shape)

(18875000,)


<font size=3>Note that the size of the mask is the same size as all the samples in the reader. So now we are ready to make a Masked Producer.</font>

In [31]:
masked_pro = producer(reader, chunksize=100000, axis=-1, mask=mask)

In [32]:
print(masked_pro)

MaskedProducer Object
---Attributes & Properties---
{'data': ReaderProducer(data, chunksize, axis, **kwargs),
 'axis': -1,
 'kwargs': {},
 'mask': ArrayProducer(data, chunksize, axis, **kwargs),
 'chunksize': 100000,
 'shape': (4, 150000)}

Type help(MaskedProducer) for full documentation


<font size=3>Notice the size of the masked producer is now 150K. This is because we are keeping rest and exploring periods of the EEG which amounts to 30 secs of data @ 5 KHz = 150K samples</font>

In [33]:
# as before, we can loop over the producer with a mask 
# this will give us the shape of the produced arrays ONLY during periods of rest and exploring
shapes = [arr.shape for arr in masked_pro]
print(shapes)

[(4, 100000), (4, 50000)]


<font size=3>In the above example we built a mask from a list of annotation dataclass instances but a masked producer can be built using any numpy boolean array. This allows for spohisticated masking conditions. For example to mask between certain hours of the EEG and filter out periods like grooming, you can  construct 2 mask <i>and</i> take their intersection as the mask to apply to the producer.</font>

### Producers from Generating functions

<font size=3>A major question that developers of Openseize needed to address is <font color='darkcyan'><b>What good is a producer if you have to convert it into an array in order to compute something?</b></font>

<font size=3>To address this, Openseize relies on converting generating functions into producers. This allows for computations to be computed on-the-fly inside the body of a generating function and converted into a multitransversal producer iterable. This is a lot to take in so we are going to go through a few simple examples.
  
<font size=3>To wield openseize taking advantage of its high memory effeciency, you'll need to familarize yourself with generating functions: <a href=https://realpython.com/introduction-to-python-generators/>generating functions tutorial</a></font>

In [34]:
# lets start by re-examining our producer built from the values in the edf file
print(rpro)

ReaderProducer Object
---Attributes & Properties---
{'data': Reader(path: Union[str, pathlib.Path]) -> None,
 'axis': 1,
 'kwargs': {},
 'chunksize': 100000,
 'shape': (4, 18875000)}

Type help(ReaderProducer) for full documentation


<font size=3> Now lets say that we want to take every value in this producer and square it. To do this we will need to loop over the producer but we don't want to store all the squared values to an array since that take a lot of memory. The solution is to build a generating function.</font>


In [35]:
def squared(pro):
    """A generating function that yields the squared values of each subarray in a producer."""
    
    for arr in pro:
        yield arr**2

<font size=3>Now lets make our generator and  run it. We will compute the largest squared value in the data.

In [36]:
# compute the max squared value across all chunks
gen = squared(rpro) #make a generator

extreme = 0
cnt = 0
for arr in gen:
    
    # max of this sub arr
    m = np.max(arr)
    cnt += 1
    
    if m > extreme:
        extreme = m

print('The largest squared value of the {} arrays is {}'.format(cnt, extreme))

The largest squared value of the 189 arrays is 1061277445.6032624


<font size=3>The problem with this approach is that the generator 'gen' has now been exhausted. So anytime we need the square values we have to make a new generator from the generating function. To see this try to get the generator's next value...</font>

In [37]:
try:
    next(gen)

except StopIteration:
    print ("Can't do it-- this gen is empty")

Can't do it-- this gen is empty


<font size=3>Another problem is that the generator can't be copied. When you make a copy (called teeing) and advance one generator, you advance all copies of the generator. This is a huge problem because we need to make copies in order to advance generators independently to solve problems where the computed value at a sample depends on the surrounding sample values.</font>

<font size=3><font color='darkcyan'><b>The way out of these problems is to make producers from generators. Lets go through this.</b></font>

In [38]:
# lets make a producer of squared values using the squared generating function

# notice we pass in the generating function, a chunksize, an axis and the shape of all arrays 
# that will be yielded by the generating function. Since it just squares values, the shape is the 
# same as the rpro shape. Lastly squared is a function and it needs a producer so we pass that as a
# keyword argument

squared_pro = producer(squared, chunksize=120000, axis=-1, shape=rpro.shape, pro=rpro)

# lastly notice that we gave a chunksize 120k that was different that rpro chunksize = 100k. That's ok, 
# producers are smart enough to figure out how to collect values from the generating function until the 
# new chunksize is reached.

In [39]:
# lets again compute the maximum squared value

extreme = 0
cnt = 0
for arr in squared_pro:
    
    # max of this sub arr
    m = np.max(arr)
    cnt += 1
    
    if m > extreme:
        extreme = m
        
# the number of arrays will be less b/c we increased chunksize to 120k
print('The largest squared value of the {} arrays is {}'.format(cnt, extreme))

The largest squared value of the 158 arrays is 1061277445.6032624


In [40]:
# since squared_pro is a producer, it is multitransversal
cnt = 0
for sub_arr in squared_pro:
    cnt += 1

print('squared_pro has {} subarrays.'.format(cnt))

squared_pro has 158 subarrays.


In [41]:
# making a copy of the producer is easy because openseize supports producer creation from producers.
squared_pro2 = producer(squared_pro, chunksize=10000, axis=-1) #These move independently from each other.

<font size=3>To recap, we can convert generating functions into producers. These producers are both multitransversal and support copying, features that are absent from generators. This makes them far more usable in numerical code.

<font size=3>To simplify creating producers from generating functions, Openseize uses a decorator called as_producer. This decorator can turn generating functions into producers without the need to explicitly call the producer creator "producer(...)". <b>Lets see this in action.</b> 

In [42]:
#again we will make a squaring producer but this time using the as_producer decorator

@as_producer
def square_deco(pro):
    for arr in pro:
        yield arr**2

In [43]:
# lets again compute the maximum squared value

extreme = 0
cnt = 0
for arr in square_deco(rpro):
    
    # max of this sub arr
    m = np.max(arr)
    cnt += 1
    
    if m > extreme:
        extreme = m

print('The largest squared value of the {} subarrays is {}'.format(cnt, extreme))

The largest squared value of the 189 subarrays is 1061277445.6032624


In [44]:
# since squared_pro is a producer it is multitransversal
cnt = 0
for sub_arr in square_deco(rpro):
    cnt += 1

print('squared_pro has {} subarrays.'.format(cnt))

# Did we get the right number of subarrays
samples = rpro.shape[-1]
csize = rpro.chunksize
print('We excpected {} subarrays'.format(samples // csize + bool(samples % csize)))

squared_pro has 189 subarrays.
We excpected 189 subarrays


<font size=3>This section brought us into the deep topic of converting generating functions into producers. Not all users will need this. Openseize provides many tools that you can just call like a regular function. Just know that under-the-hood, these functions are using this conversion. If you need to compute quantities that are not part of openseize and your data is too large for memory, then creating producers from generating functions is the way to go.</font>

### Producers From Producers

<font size=3> In the last section, we noted that you can copy producers and then use them independently. The construction of producer copies is straightforward.</font>

In [45]:
# print our EDF data producer
print('Original Producer\n',rpro)
print('\n')
# make a new producer changing the chunksize
newpro = producer(rpro, chunksize=150000, axis=-1)
print('New Producer\n', newpro)

Original Producer
 ReaderProducer Object
---Attributes & Properties---
{'data': Reader(path: Union[str, pathlib.Path]) -> None,
 'axis': 1,
 'kwargs': {},
 'chunksize': 100000,
 'shape': (4, 18875000)}

Type help(ReaderProducer) for full documentation


New Producer
 ReaderProducer Object
---Attributes & Properties---
{'data': Reader(path: Union[str, pathlib.Path]) -> None,
 'axis': -1,
 'kwargs': {},
 'chunksize': 150000,
 'shape': (4, 18875000)}

Type help(ReaderProducer) for full documentation


<font size=3>These producers can be iterated over independently.</font>

## Resource Recovery

<font size=3>Throughout the last 1/2 of this tutorial, we have used a reader instance. This instance uses resources to provide access to the opened edf file. It is important to close these instances when you are done to recover those resources. In the file-reading demo, we will show how to create a reader using a <font color='firebrick'>Context Manager</font> which will automatically close the file when you are done with it.

In [46]:
# close the open reader instance
reader.close()

## Summary

<font size=3><b>Producers are at the heart of openseize.</b> All available methods in the modules of openseize can accept producer of arrays. This means openseize can compute quantities even when the input or output may not fit into virtual memory. Your data may not require this level of memory effeciency. If not, you can also supply ndarrays into openseize methods. Be aware that some methods may run faster using producer inputs and some may run faster using full arrays. The goal here is to give users options for how they want to use their machine's memory. Lastly, chunksizes can also effect the speed of computations. There is a goldilock's zone that will depend on the available memory of your machine. Openseize gives you the flexibility to adjust this parameter so you decide how much memory you want a process to take.</font>