# Data structures

The main data structures are:

- `Array`: can be thought of as a numpy `ndarray` with a physical unit
- `Vector`: a container whose components are Arrays
- `Datagroup`: a dictionary of Arrays and/or Vectors
- `Dataset`: a dictionary of Datagroups, and can also contain additional metadata

This notebook aims to give both a description of each structure,
and show how they can be used efficiently to manipulate and explore your simulation data.
They aim to have an API which is close to the Python dict and the numpy `ndarray`.

In [None]:
import osyris
import numpy as np

We load a data output from a star formation simulation

In [None]:
data = osyris.RamsesDataset(8, path="osyrisdata/starformation").load()

## The `Dataset` class

The `Dataset` object that has a `__str__` representation,
which list all the contents of `data` in an easy manner:

In [None]:
data

The `Dataset` class aims to behave very similarly to a Python `dict`.
To access one element of `data`, we index it by using the group names

In [None]:
data["mesh"]

Each entry in the `Dataset` is a `Datagroup`.

It is also possible to store additional metadata under the `meta` property

In [None]:
data.meta

## The `Datagroup` class

The `Datagroup` can be thought of a Python `dict` of arrays,
which enforces shape alignment, so that all members of a `Datagroup` have the same length.

The elements of a `Datagroup` can be accessed using the variables names

In [None]:
mesh = data["mesh"]
mesh["density"]

Each entry in the `Datagroup` is an `Array` object, which will be described in detail below.

`Dataset` and `Datagroup` both provide standard dictionary iterators,
which can be used to loop over their contents:

In [None]:
data.keys()

In [None]:
mesh.keys()

Because it ensures all its members have the same length,
a `Datagroup` can be sliced along its length

In [None]:
mesh[10:20]

## The `Array` class

Each entry in the `Datagroup` dictionary is an `Array` (or `Vector`, see below) object:

In [None]:
a = mesh["density"]
type(a)

Its string representation lists the array's name,
the minimum and maximum values in the array, its physical unit, and the number of elements in the array.

In [None]:
a

An `Array` can basically be thought of as a `numpy` array with a physical unit: 

In [None]:
a.values

In [None]:
a.unit

Operations you would normally perform on a numpy array, such as slicing, are supported:

In [None]:
a[101:255]

Note that this returns a view onto the original data, instead of making a copy.

Using `numpy`'s `array_function` and `array_ufunc` protocols, `Array` also supports most `numpy` operations, e.g.

In [None]:
np.log10(a)

In [None]:
np.sum(a)

Note that in these cases, a new `Array` is returned, and they are not attached to any `Datagroup`,
and therefore do not have a name.
To insert a new `Array` into a `Datagroup`, simply use the dictionary syntax

In [None]:
mesh["log_rho"] = np.log10(a)
mesh

### Example: find the system center

A simple way to find the centre of our protostellar system is to use the coordinate of the cell with the highest density in the simulation.

In [None]:
ind = np.argmax(mesh["density"])
center = mesh["position"][ind]
center

## Array arithmetic and units

Units are automatically handled (and conversions carried out) when performing arithmetic on arrays.
For instance, we want to compute a new quantity which represents the mass inside each cell.

The data density is in `g / cm**3`, while the cell size is in `cm`, giving

In [None]:
mesh["mass"] = mesh["density"] * (mesh["dx"] ** 3)
mesh["mass"]

This helps to free mental capacity and allows the user to focus on what is important: **doing science**.

### Manual unit conversions

Sometimes, it is useful to convert from CGS units to other base units:

In [None]:
mesh["mass"].to("M_sun")

Note that in this case a new `Array` is returned. If you want to update the entry in your `Datagroup`, use

In [None]:
mesh["mass"] = mesh["mass"].to("M_sun")

Units are properly handled in operations that involve non-base units by first converting both operands to their base units,
before performing the operation.
In the following example, we compute the mass from a `density` in `g / cm**3` and a cell size in `au`:

In [None]:
dx_in_au = mesh["dx"].to("au")
(mesh["density"] * (dx_in_au**3)).to("M_sun")

### Units also provide safety

Physical units also provide a certain level of safety around operations.
By assigning some quantities to intermediate variables,
it is often easy to lose track of the exact quantity (or at least its dimensionality) that a variable represents.

Physical units can prevent performing operations on mismatching quantities:

In [None]:
try:
    mesh["density"] + mesh["mass"]
except Exception as e:
    print(e)

Physical units can also help find errors in an analysis workflow,
when looking at a final value for a computed star formation rate for example,
and realising that the final unit represents a quantity per unit volume,
while the user was trying to calculate an integrate rate.

### Operations with floats

Operations with floats are also supported:

In [None]:
mesh["density"] * 1.0e5

## The `Vector` class

Datagroups contain scalar variables (such as the gas density above) as well as vector variables.
Vector variables have more than one component, represented by a special `Vector` class,
which contains one `Array` for each component.

A Vector variable can be identified in a Datagroup by the list of components at the end of its string representation `{x,y,z}`:

In [None]:
mesh["velocity"]

The Min and Max values printed above are from the norm of the Vector.
The components of the Vector are accessed via the `.x`, `.y`, `.z` properties:

In [None]:
mesh["velocity"].y

The shape of the Vector is the same as the shape of the Array:

In [None]:
mesh["velocity"].shape == mesh["density"].shape

### Vector operations

All operations that can be done on Arrays (including Numpy operations) can also be carried out with Vectors:

In [None]:
mesh["velocity"] + mesh["velocity"]

In [None]:
np.sum(mesh["velocity"])

### Vector - Array operations

In operations that combine both Vectors and Arrays, are automatically broadcasted:

In [None]:
mesh["momentum"] = mesh["density"] * mesh["velocity"]
mesh["momentum"]

### Dot and cross prodcuts

Vectors also provide additionaly functionality, such as `dot` and `cross` products

In [None]:
mesh["velocity"].dot(mesh["velocity"])

In [None]:
mesh["position"].cross(mesh["velocity"])

## Compatibility with Pandas

It is possible to convert a `Datagroup` to a Pandas `DataFrame`, using the `to_pandas` method.

**Note that the physical units are lost** in the conversion, so take care when using this!

In [None]:
df = mesh.to_pandas()
df

You can then use some useful features of Pandas, such as `groupby`, to go further in your data analysis:

In [None]:
df.groupby("level").sum()