<img src='images/scipp-logo.png' width="400" height="400" style="display: block; margin-left: auto; margin-right: auto; width: 640px;">

<div style="text-align: center;">Multi-dimensional data arrays with labeled dimensions</div>
<br>
<br>

**Jan-Lukas Wynen** (<i class="fa-solid fa-envelope"></i> jan-lukas.wynen@ess.eu)
<br>

https://scipp.github.io/

# Data Reduction at ESS

<img src='images/software-stack.svg' style="display: block; margin-left: auto; margin-right: auto; width: 75%;">

### The problem with numpy (1)

Which dimension is which?

In [None]:
import numpy as np
a = np.array([[1, 2], [3, 4]])
a

In [None]:
a[0]

Or like this?

In [None]:
a[:, 0]

### Scipp's solution: Labeled dimensions

In [None]:
import scipp as sc
v = sc.array(dims=['x', 'y'], values=[[1, 2], [3, 4]])
v

In [None]:
v['x', 0]

### The problem with numpy (2)

How are arrays associated?

In [None]:
time = np.array([0, 1, 2, 3])
speed = np.array([0.1, 0.5, 1.3, 0.7])

Do both come from the same measurement?

### Scipp's solution: Data Arrays

In [None]:
da = sc.DataArray(sc.array(dims=['time'], values=speed, unit='m/s'),
                  coords={'time': sc.array(dims=['time'], values=time, unit='min')})
sc.table(da)

### Phiscal units prevent mistakes

In [None]:
da

Total distance:

In [None]:
sc.sum(da.data * da.coords['time'])

### Wait, isn't this just xarray?

Sort of, but scipp has

- builtin physical units (xarray via pint)
- variances
- non-destructive masks
- bin-edge coordiantes
- binned data

### What about pandas?

`scipp.DataArray` similar to `pandas.DataFrame` but multi-dimensional

In [None]:
da = sc.DataArray(sc.array(dims=['x', 'y'], values=np.random.normal(size=[5, 10])),
                  coords={'x': sc.arange('x', 5),
                          'y': sc.arange('y', 4, 14)})
da

In [None]:
sc.show(da)

### Coordinates prevent mistakes

In [None]:
da

In [None]:
da2 = sc.DataArray(sc.array(dims=['x'], values=np.ones(5)),
                  coords={'x': sc.arange('x', 1, 6)})
da + da2

### Attributes: unchecked coordinates

In [None]:
da_attr = da.copy()
da_attr.attrs['x'] = da_attr.coords.pop('x')
da_attr

In [None]:
da_attr + da2

### Masks: Ignore elements without removing them 

In [None]:
masked = da.copy()
masked.masks['m'] = sc.array(dims=['x'], values=[True, False, False, True, False])
masked

In [None]:
da.sum()

In [None]:
masked.sum()

### Plotting

In [None]:
masked.plot()

# Questions?

Empty cell to show stuff

## Binned Data

- main features
    - labeled dims
    - units
    - variances
    - bin-edges
    - binned data
    - non destructive masks
- numpy confusing
- xarray good but missing some stuff (but more mature and better dask support)
- variables: arrays with benefits
    - html output
    - indexing: positional with dim-label
    - checks for units
    - how to make (array / scalar / arange)
- data arrays: like dataframe / table but multi-dim
    - data with coord
    - html + table + show
    - plot
    - slicing: positional + label-based
    - computation: compares coords, acts on data
    - also attrs: like coord but dropped if mismatch
    - also masks: non-destructive
- datasets: multiple data arrays with shared coords
- binned data
    - event list, want new dim (e.g. time) but inhomogeneous layout
    - binned data: kind of like data array of data arrays
    - binning != histogramming
        - .bins.sum()
        - sc.histogram
    - can change binning
    - bin by edges or by groups