# Data structures

The main data structures are:

- the `Array` object, which can be thought of as a numpy `ndarray` with a physical unit
- the `Datagroup` object, which acts as a dictionary of Arrays
- the `Dataset` object, which acts as a dictionary of Datagroups, and can also contain additional metadata

This notebook aims to give both a description of each structure,
and show how they can be used efficiently to manipulate and explore your simulation data.
They aim to have an API which is close to the Python dict and the numpy `ndarray`.

In [None]:
import osyris
import numpy as np

We load a data output from a star formation simulation

In [None]:
data = osyris.Dataset(8, scale="au", path="osyrisdata/starformation").load()

## The `Dataset` class

The `Dataset` object that has a `__str__` representation,
which list all the contents of `data` in an easy manner:

In [None]:
data

The `Dataset` class aims to behave very similarly to a Python `dict`.
To access one element of `data`, we index it by using the group names

In [None]:
data["hydro"]

Each entry in the `Dataset` is a `Datagroup`.

It is also possible to store additional metadata under the `meta` property

In [None]:
data.meta

## The `Datagroup` class

The `Datagroup` can be thought of a Python `dict` of arrays,
which enforces shape alignment, so that all members of a `Datagroup` have the same length.

The elements of a `Datagroup` can be accessed using the variables names

In [None]:
data["hydro"]["density"]

Each entry in the `Datagroup` is an `Array` object, which will be described in detail below.

`Dataset` and `Datagroup` both provide standard dictionary iterators,
which can be used to loop over their contents:

In [None]:
data.keys()

In [None]:
data["amr"].values()

Because it ensures all its members have the same length,
a `Datagroup` can be sliced along its length

In [None]:
data["hydro"][10:20]

## The `Array` object

Each entry in the `Datagroup` dictionary is an `Array` object:

In [None]:
a = data["hydro"]["density"]
type(a)

Its string representation lists the array's key in its parent `Datagroup`,
the minimum and maximum values in the array, its physical unit, and the number of elements in the array.

In [None]:
a

An `Array` can basically be thought of as a `numpy` array with a physical unit: 

In [None]:
a.values

In [None]:
a.unit

Operations you would normally perform on a numpy array, such as slicing, are supported:

In [None]:
a[101:255]

Note that this returns a view onto the original data, instead of making a copy.

Using `numpy`'s `array_function` and `array_ufunc` protocols, `Array` also supports most `numpy` operations, e.g.

In [None]:
np.log10(a)

In [None]:
np.sum(a)

Note that in these cases, a new `Array` is returned, and they are not attached to any `Datagroup`,
and therefore do not have a name.
To insert a new `Array` into a `Datagroup`, simply use the dictionary syntax

In [None]:
data["hydro"]["log_rho"] = np.log10(a)
data["hydro"]

### Example: find the system center

A simple way to find the centre of our protostellar system is to use the coordinate of the cell with the highest density in the simulation.

In [None]:
ind = np.argmax(data["hydro"]["density"])
center = data["amr"]["position"][ind.values]
center

## Array arithmetic and units

Units are automatically handled (and conversions carried out) when performing arithmetic on arrays.
For instance, we want to compute a new quantity which represents the mass inside each cell.

The data density is in `g / cm**3`

In [None]:
data["hydro"]["density"]

while the cell size is in astronomical units

In [None]:
data["amr"]["dx"]

However, we can still multiply them together

In [None]:
data["hydro"]["mass"] = data["hydro"]["density"] * (data["amr"]["dx"]**3)
data["hydro"]["mass"]

The conversion between `au` and `cm` is automatically handled by first converting both operands to their base units, before performing the operation.

This helps to free mental capacity and allows the user to focus on what is important: **doing science**.

### Manual unit conversions

Sometimes, it is useful to convert from CGS units to other base units:

In [None]:
data["hydro"]["mass"].to("msun")
data["hydro"]["mass"]

### Units also provide safety

Physical units also provide a certain level of safety around operations.
By assigning some quantities to intermediate variables,
it is often easy to lose track of the exact quantity (or at least its dimensionality) that a variable represents.

Physical units can prevent performing operations on mismatching quantities:

In [None]:
try:
    data["hydro"]["density"] + data["hydro"]["mass"]
except Exception as e:
    print(e)

Physical units can also help find errors in an analysis workflow,
when looking at a final value for a computed star formation rate for example,
and realising that the final unit represents a quantity per unit volume,
while the user was trying to calculate an integrate rate.

### Automatic broadcast

Arrays can either represent a scalar quantity (e.g. density) or a vector quantity (e.g. velocity).
When performing arithmetic that involves both scalars and vectors, an automatic broadcast mechanism handles the operations:

In [None]:
data["hydro"]["momentum"] = data["hydro"]["density"] * data["hydro"]["velocity"]
data["hydro"]["momentum"]

### Operations with floats

Operations with floats are also supported:

In [None]:
data["hydro"]["density"] * 1.0e5