# Dataset

The Midgard **Dataset** can be used to store data in memory or on disk. If the **Dataset** is saved as file on disk, than the Hierarchical Data Format 5 (HDF5) is applied. HDF5 is a high-performance storage format designed specifically for storing tabular arrays of data. Data or subsets of data can be stored and accessed very efficiently with HDF5. Python supports HDF5 via the **h5py** package, which is designed to work with **NumPy**. 

The **Dataset** consists of **fields**. Each field is a data array, whereby the number of observations for all fields are the same. The fields can have different kind of data types. Following field datatypes can be defined:

- bool
- collection
- float
- position
- position_delta
- posvel
- posvel_delta 
- sigma
- text
- time
- time_delta

For each field datatype exists an method for adding a specific type of data to **Dataset**. These functions are called **add_\< datatype \>()**, whereby \< datatype \> is a placeholder for the existing field datatypes (e.g. float or time).

The figure below illustrates a **Dataset** with different kind of field datatypes, which is saved in memory as an array. The **Dataset** in the memory can be saved as HDF5 file on disk.

<center><img src="figures/dataset/dataset_overview.png", width=1000/></center>

## Use Dataset

An example is shown, how to use a Dataset. The first step is to generate an instance of the Dataset class, whereby the size of the Dataset fields has to be defined. Afterwards fields have to be added by defining the field datatype. If necessary the Dataset can be written as file to disk or can be read afterwards again. An example:

In [None]:
# Import Dataset package
from midgard.data import dataset

# Get Dataset instance
dset = dataset.Dataset()

# Define size of Dataset arrays
dset.num_obs = 5  # also possible via dataset.Dataset(num_obs=5)

# Add float field to Dataset
dset.add_float(name='numbers', val=[1,2,3,4,5])

# Two ways to access Dataset 'numbers' field
numbers = dset.numbers
numbers = dset['numbers']

# Access element of Dataset
first_number = dset.numbers[0]
print(f"numbers: {numbers}\nfirst_number: {first_number}")

# Write Dataset on disk as HDF5 file
dset.write(file_path='./examples/dataset/numbers.hdf5')

# Read Dataset from HDF5 file
dset2 = dataset.Dataset()
dset2 = dset2.read(file_path='./examples/dataset/numbers.hdf5')
print(f"\ndset:\n{dset}\n\ndset2:\n{dset2}")

## Dataset attributes and methods

General attributes of a **Dataset** are described in the table below. Hereby should be noted, that additional attributes can be added. This additional attributes are called **fields**, which can be added to the **Dataset** with the command **add_\< datatype \>()** as mentioned above. How to use **add_\< datatype \>()** will be described in section _Field datatypes_ for each field data type.

| Attribute   | Description                                                           |
| :---------- | :-------------------------------------------------------------------- |
| default_field_suffix      | **TODO**                                                |
| fields      | Names of fields in the dataset                                        |
| meta        | Dictionary with meta information of data                              |
| num_obs     | Number of observations in dataset                                     |
| plot_fields | Names of fields in the dataset that would make sense to plot          |
| vars        | Dictionary with variables used for example to generate dataset file name  |
| version     | Dataset and Midgard version                                           |

The table below describes shortly the **Dataset** methods.

| Method      | Description                                                           |
| :---------- | :-------------------------------------------------------------------- |
| add_\< datatype \>     | For each field datatype exists an method for adding a specific type of data to Dataset. These functions are called add_< datatype >(), whereby < datatype > is a placeholder for the existing field datatypes (e.g. float or time).|
| apply       | Apply a function to a field                                           |
| as_dataframe| Return a representation of the dataset as a Pandas DataFrame          |
| as_dict     | Return a representation of the dataset as a dictionary                |
| count       | Count the number of unique values in a field                          |  
| difference  | Compute the difference between two datasets: self - other             |
| extend      | Add observations from another dataset to the end of this dataset      |
| filter      | Filter observations                                                   |
| from_dict   | Convert a simple data dictionary to a dataset                         |
| mean        | Calculate mean of a field                                             |
| merge_with  | Merge in observations from other datasets. Note that this is quite strict in terms of which datasets can extend each other. They must have exactly the same tables. Tables containing several independent fields (e.g. float, text) may have different fields, in which case fields from both datasets are included. If a field is only defined in one dataset, it will get `empty` values for the observations in the dataset it is not defined. |
| num         | Number of observations satisfying the filters                         |
| plot_values | Return values of a field in a form that can be plotted                |
| read        | Read a dataset from file                                              |
| rms         | Calculate Root Mean Square of a field                                 |
| std         | Calculate the standard deviation of a field                           |
| subset      | Remove observations from all fields based on index                    |
| unique      | List unique values of a given field                                   |
| unit        | Unit for values in a given field (e.g. meter)                                     |
| unit_short  | Short description of unit for values in a given field (e.g. m)  |
| update_from | Transfers dataset fields from other Dataset to this Dataset. This will not create a copy of the data in the other Dataset |
| write       | Write a dataset to file                                               |
| unique      | List unique values of a given field                                   |

## Field datatypes

The field datatypes are described in this section and how to add them to a Dataset.

### bool

**bool** field consists of NumPy array with boolean values.


In [None]:
# Import Dataset package
from midgard.data import dataset

# Get Dataset instance
dset = dataset.Dataset(num_obs=5)

# Add bool field to Dataset by defining unit of field elements
dset.add_bool(name='index', val=[True, False, True, True, False])

# Access bool field 'index'
print(f"Field 'index': {dset.index}")

### float

**float** field consists of NumPy array with float values.

In [None]:
# Import Dataset package
from midgard.data import dataset

# Get Dataset instance
dset = dataset.Dataset(num_obs=5)

# Add float field to Dataset by defining unit of field elements
dset.add_float(name='numbers', val=[1,2,3,4,5], unit='meter')

# Access float field 'numbers'
print(f"Field 'numbers': {dset.numbers}")

# Print unit of float field 'numbers'
print(f"Unit of field 'numbers': {dset.unit('numbers')}")


### position

**position** field consists of PositionArray arrays. **position** fields can be initialized with following **systems**:

- trs: terrestrial reference system with unit [m, m , m]
- llh: latitude, longitude and height with unit [rad, rad, m]

The position field has to be initialized with the correct unit depending on the chosen **system**. The unit has to be **(meter, meter, meter)** for **trs** system or **(radian, radian, meter)** for **llh** system. 

In [None]:
# Import NumPy package
import numpy as np

# Import Dataset package
from midgard.data import dataset

# Get Dataset instance
dset = dataset.Dataset(num_obs=2)

# Add position field to Dataset
val = np.array([[100, 100, 100], [200, 200, 200]])
dset.add_position(name='pos', val=val, system='trs')

# Access position field 'pos'
print(f"Field 'pos' in 'trs' system: {dset.pos}")

# Print unit of position field 'pos'
print(f"Unit of field 'pos' in 'trs' system: {dset.unit('pos')}")
      
# Dataset field can be converted from terrestrial reference system to
# latitude, longitude and height
print(f"\nField 'pos' in 'llh' system: {dset.pos.llh}")
print(f"Unit of field 'pos' in 'llh' system: {dset.unit('pos.llh')}")

NOTE: More about **Position** class objects functionality can be found in jupyter notebook **./midgard/documents/notebooks/position.ipynb**.

### posvel

**posvel** field consists of PosVelArray arrays. **posvel** fields can be initialized with following **systems**:

- kepler: Kepler elements (a, e, i, Omega, omega, E) with unit [m, - , rad, rad, rad, rad]
- trs: terrestrial reference system (x, y, z, vx, vy, vz) with unit [m, m , m, m/s, m/s, m/s]

The posvel field has to be initialized with the correct unit depending on the chosen **system**. The unit has to be **(m, - , rad, rad, rad, rad)** for **kepler** and **(m, m, m, m/s, m/s, m/s)** for **trs** system.

In [None]:
# Import NumPy package
import numpy as np

# Import Dataset package
from midgard.data import dataset

# Get Dataset instance
dset = dataset.Dataset(num_obs=2)

# Add position field to Dataset
val = np.array([[15095082.616, -16985925.155, 18975783.780,
                     1814.893,      -587.648,    -1968.334,
                ], 
                [13831647.196, 24089264.569, 10227973.970,
                     -705.078,     -745.436,     2708.633,
      ]])
dset.add_posvel(name='posvel', val=val, system='trs')

# Access posvel field 'posvel'
print(f"Field 'posvel' in 'trs' system: {dset.posvel}")

# Print unit of posvel field 'posvel'
print(f"Unit of field 'posvel' in 'trs' system: {dset.unit('posvel')}")
      
# Dataset field can be converted from terrestrial reference system to
# Kepler elements
print(f"\nField 'posvel' in 'kepler' system: {dset.posvel.kepler}")
print(f"Unit of field 'posvel' in 'kepler' system: {dset.unit('posvel.kepler')}")

NOTE: More about **PosVel** class objects functionality can be found in jupyter notebook **./midgard/documents/notebooks/position.ipynb**.

### text

**text** field consists of numpy arrays and can be initialized as follows:

In [None]:
# Import Dataset package
from midgard.data import dataset

# Get Dataset instance
dset = dataset.Dataset(num_obs=5)

# Add text field to Dataset
dset.add_text(
            name="satellite",
            val=["E01", "E02", "E05", "G01", "G02"],
)

# Access text field
print(f"Field 'satellite': {dset.satellite}")

### time

**time** field consists of Time object arrays. The time **scale** and **format** has to be specified by initialization **time** fields in dataset. **time** fields can be initialized with different kind of time **scales** and **formats**. What kind of time scales and formats exists and more about ***Time** class, is described in jupyter notebook **./midgard/documents/notebooks/time.ipynb**.

In [None]:
# Standard libary import
from datetime import datetime

# Import Dataset package
from midgard.data import dataset

# Get Dataset instance
dset = dataset.Dataset(num_obs=2)

# Add time field to Dataset by defining time scale and format
dset.add_time(
            name="time",
            val=[datetime(2019,2,1), datetime(2019,2,2)],
            scale="utc",
            fmt="datetime",
)

# Access time field
print(f"Field 'time': {dset.time}")

# Print time scale and format of time field
print(f"Format: {dset.time.fmt}")
print(f"Scale: {dset.time.scale}")