# The Recording Container: Cedalion's main data structure and a guide to indexing  

This example notebook introduces the main data classes used by cedalion, and provides examples of how to access and index them.

## Overview

**The class `cedalion.dataclasses.Recording` is Cedalion's main data container that can be used to carry related data objects through the program.** It can store time series, masks, auxiliary timeseries, probe, headmodel and stimulus information as well as meta data about the recording.
It has the following properties:

- It resembles the [NIRS group in the snirf specification](https://github.com/fNIRS/snirf/blob/v1.1/snirf_specification.md#nirsi), which provides storage for much of the data stored in a `Recording` (e.g. time series map to [data elements](https://github.com/fNIRS/snirf/blob/v1.1/snirf_specification.md#nirsidataj), [probe](https://github.com/fNIRS/snirf/blob/v1.1/snirf_specification.md#nirsiprobe), [stimulus](https://github.com/fNIRS/snirf/blob/v1.1/snirf_specification.md#nirsistimj) and [meta data](https://github.com/fNIRS/snirf/blob/v1.1/snirf_specification.md#nirsimetadatatags) are stored per NIRS element, etc). Consequently, the methods `cedalion.io.read_snirf` and `cedalion.io.write_snirf` methods operate on lists of recordings.
- different time series and masks are stored in ordered dictionaries
  - the user differentiates time series by name
  - there is a set of canonical names used by `read_snirf` to assign names to time series
    ```
    CANONICAL_NAMES = {
          "unprocessed raw": "amp",
          "processed raw": "amp",
          "processed dOD": "od",
          "processed concentrations": "conc",
          "processed central moments": "moments",
          "processed blood flow index": "bfi",
          "processed HRF dOD": "hrf_od",
          "processed HRF central moments": "hrf_moments",
          "processed HRF concentrations": "hrf_conc",
          "processed HRF blood flow index": "hrf_bfi",
          "processed absorption coefficient": "mua",
          "processed scattering coefficient": "musp",
    }
    ```
- time series are stored in the dictionaries in the order that they were added
- convenient access to the last changed time series + canonical names -> consecutive transformations of time series without the need to specify time series by name -> workflows
- `rec[key]` is a shortcut for `rec.timeseries[key]` 
- not all information stored in a `Recording` can be stored in snirf files, e.g. for masks, the headmodel and auxiliar objects there is no provision in the snirf specification. We will probably use sidecard files or sidecar hdf groups to store these.

![Recording Container](/img/recording/rec_container_overview.png)

## Exploring the recording container fields with some example data

In [None]:
import os
import cedalion
import cedalion.io
import cedalion.datasets
import cedalion.xrutils as xrutils
import xarray as xr

# Loading an example dataset will create a recording container. 
# Alternatively you can load your ow snirf file using cedalion.io.snirf.read_snirf(PATH_TO_FILE)
rec = cedalion.datasets.get_fingertapping()
display(rec)

### The timeseries field
we loaded raw amplitude data and can now access it:

In [None]:
display(rec["amp"])

Since we are interested not only in raw "amp"litude data, we convert this data to concentration using the modified beer-lambert law and save it under **"conc"** in the recording container

In [None]:
import cedalion.nirs

# define DPFs and convert to HbO/HbR using the beer lambert law law
dpf = xr.DataArray(
        [6, 6],
        dims="wavelength",
        coords={"wavelength": rec["amp"].wavelength},
    )
rec["conc"] = cedalion.nirs.beer_lambert(rec["amp"], rec.geo3d, dpf)

display(rec["conc"])

### The geo3d field
we have already used channel distances from this field to calculate the concentrations using the beer-lambert law above. 
The geo3d and geo2d fields are DataArrays of geometric points, whose "magnitude" is the 3d coordinate in 3D / 2D space. They also have two coordinates: a "label", such as "S1" for Source 1, and a "type" of PointType.Source

In [None]:
display(rec.geo3d)

### The stim field
contains labels for any experimental stimuli that were logged during the recording. Turns out each condition in the experiment was 5 seconds long.

In [None]:
display(rec.stim)

We can see that the trial_type was encoded numerically, which can be hard to read. If we know the experiment we can rename the stimuli using the "rename_events" function

In [None]:
rec.stim.cd.rename_events(
        {"1.0": "control", "2.0": "Tapping/Left", "3.0": "Tapping/Right"}
    )

display(rec.stim)

### The masks field
Lastly, we create a **mask** based on an SNR threshold. A mask is a Boolean DataArray that flags each point across all coordinates as either "true" or "false", according to the metric applied. Here we use an SNR of 3 to flag all channels in the raw "amp" timeseries as "False" if their SNR is below the threshold. Since SNR is calculated across the whole time, the time dimension gets dropped. Applying this mask later on to a DataArray time series works implitly thanks to the unambiguous xarray coordinates in the mask and timeseries (here for instance the channel name). 

In [None]:
import cedalion.sigproc.quality as quality
# SNR thresholding using the "snr" function of the quality subpackage using an SNR of 3
_, rec.masks["snr_mask"] = quality.snr(rec["amp"], 3)

display(rec.masks["snr_mask"])

### The headmodel / aux_obj field
The recording container does not yet contain a mask or head model. We load an ICBM152 atlas and create the **headmodel**

In [None]:
import cedalion.imagereco.forward_model as fw

# load segmentation data from the icbm152 atlas
SEG_DATADIR_ic152, mask_files_ic152, landmarks_file_ic152 = cedalion.datasets.get_icbm152_segmentation()

# create forward model class for icbm152 atlas
rec.head_icbm152 = fw.TwoSurfaceHeadModel.from_surfaces(
    segmentation_dir=SEG_DATADIR_ic152,
    mask_files = mask_files_ic152,
    brain_surface_file= os.path.join(SEG_DATADIR_ic152, "mask_brain.obj"),
    landmarks_ras_file=landmarks_file_ic152,
    brain_face_count=None,
    scalp_face_count=None
)

display(rec.head_icbm152)

## xarray  DataArray Indexing and Selecting Data
xarray DataArrays in Cedalion can be indexed "as usual". For a complete documentation visit the [xarray documentation page](https://docs.xarray.dev/en/latest/user-guide/indexing.html). A brief visual overview: 


![DataArray Indexing Overview](/img/recording/dataarray_indexing_overview.png)


Below we give some examples

In [None]:
# first we pull out a time series to save time in the following
ts = rec["amp"]
display(ts)

it usually helps to know the array's **coordinates**. these can be viewed via the .coords xarray accessor. Note that multiple coordinate axes can overlap. For instance, across the time dimension we can use "time" in seconds or "samples" in integer values. Across the "channel" dimension we can index via Source-Detector pairs (e.g. "S1D1") or via only the "source" or "detector". The latter will give us all matching elements - e.g. "S1" will give us all channels that contain source S1.

In [None]:
display(ts.coords)

Knowing the coordinates we can also acess the items / labels on the coordinate axes directly

In [None]:
display(ts.wavelength) # wavelength dimension

display(ts.time) # time dimension

#### Direct Bracket Indexing
... works as expected

In [None]:
ts[:,0,:] # first item along wavelength

In [None]:
ts[:,:,::3000] # every 3000th time point

#### Indexing by Label: .loc and .sel accessors
without using the coordinate we require knowledge of the order of dimensions in the DataArray...

In [None]:
ts.loc["S1D1", 760, :] # time series for channel S1D1 and wavelength 760nm

... or we are more explicit, in which case the order does not matter. `.sel` relies on an index. For some  coordinates (time, channel, wavelength) indexes are built. They are printed in bold face when the DataArray is displayed. Indexes are needed for efficient lookup but are not strictly necessary. Hence, we don't always build them by default.

In [None]:
ts.sel(channel="S1D1", wavelength=760)  # the same time series as above

`.sel` accepts dictionaries. Useful when dimension name is a variable

In [None]:
dim = 'wavelength'
dim_value = 760
ts.sel({dim : dim_value})

#### Indexing using logical operations
We can, for instance, choose only those data points that come after t=10s and before t=60s:

In [None]:
ts.sel(time= (ts.time  > 10 ) & (ts.time < 60.))

boolean masking works also with the .loc accessor

In [None]:
ts.loc[ts.source == "S1"]

#### Indexing using stringng matching or "isin"
first via string accessor

In [None]:
# regular expression via str accessor
ts.sel(channel=ts.channel.str.match("S[2,3]D[1,2]"))

or via the use of `isin` to select a fixed tiem or list of items

In [None]:
# item
ts.sel(channel="S1D1")

# list of items
ts.sel(channel=ts.channel.isin(["S1D1", "S8D8"]))

#### Building indices if they are not available
Repeat: `.sel` relies on an index. For some  coordinates (time, channel, wavelength) indexes are built. They are printed in bold face when the DataArray is displayed. Indexes are needed for efficient lookup but are not strictly necessary. if we would like to index via a coordinate axis for which no index is available (here the "source" coordinate), they can [be built](https://docs.xarray.dev/en/v2024.07.0/generated/xarray.DataArray.set_xindex.html#xarray.DataArray.set_xindex):

In [None]:
# build the index
ts_with_index = ts.set_xindex("source")
# now we can select by source index
ts_with_index.sel(source="S1")

#### Using coordinates from one array to index another
Here we use `ts.source` to select in `geo3d` values along the 'label' dimension. Because `ts.source` belongs to the 'channel' dimension of `ts`, the resulting `xr.DataArray` has dimensions 'channel' (from ts.source) and 'digitized' (from geo3d)

In [None]:
display(rec.geo3d)
display(ts.source)
rec.geo3d.loc[ts.source]

### Accessing xarray DataArray values with .values
e.g. to write them to a numpy array. Example: We want to pull out the actual source names of the first 3 sources..

In [None]:
# this way of indexing will not give us what we want, as it returns another xarray with coordinates etc.
display(ts.source[:3])

# instead we use the .values accessors:
display(ts.source.values[:3])

### Accessing single items in an xarray with .item
indexing a single item in an xarray is still an xarray with coordinates

In [None]:
display(ts[0,0,0]) # the first time point of the first channel and first wavelength in the DataArray

In [None]:
# to get just the item we use .item()
display(ts[0,0,0].item())

In [None]:
# of course this also works with the .sel method
ts.sel(channel="S1D1", wavelength= "760", time = "0.0").item()