# 00 Data Format

This notebook contains all the information regarding the file specification.

## Folder and file structure

A created dataset should be contained in one folder. This folder should contain two files:

* `data.h5`: Containing all the data
* `configuration.json`: Containing the Configuration of the generated dataset

Let's explore further how the files should be set up.

## Configuration.json: How we store our configuration

The configuration is a file to determine at one view how everything has been configured. This is more flexible depending on the generating packages, still this package tries to provide an overview. Generally speaking the json is configured like

```json
{
  "detector": ...,
  "generation": ...,
  "status": ...
}
```

### Detector

tbd.

### Generation

tbd.

### Status

tbd.

## Data.h5: How we store the data

The data is stored in the [HDF](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) format. The format supports datasets and groups. We will use both features in our files structure for maximum efficiency. Specifically we use the following paths:

* `detector`: Contains all the PMT position and configuration data
* `records`: Contains all record information
* `sources`: Optional for all records containing sources
  * `sources/{record_id}`: Sources should be stored by record id
* `hits`: Hits at the pmts of the detector
  * `hits/{record_id}`: Hits should be stored by record_id

Some general information:

* `record_id`: Pseudo unique integer ID of Record (Currently implemented like `uuid.uuid1().int >> 64` using the [uuid](https://docs.python.org/3/library/uuid.html) module
* `datasets`: Should be stored in an appendable and not the fixed format.
* `schema validation`: Ananke is the single point of source for the schema. Others should adapt to be imported without error (see documentation)

Let's look specifically how the different data frames should be build. To do that, we need to import a couple of things:

In [17]:
from enum import Enum
from typing import Type
import pandera as pa
from ananke.schemas.detector import DetectorSchema
from ananke.schemas.event import RecordSchema, EventRecordSchema, SourceRecordSchema, NoiseRecordSchema, RecordType, SourceType, HitSchema


def pretty_print_schema(schema_model: Type[pa.SchemaModel], indent: int = 0) -> None:
    """Helper function to pretty print a schema model.

    Args:
        schema_model: Schema model to pretty print
        indent: Initial indent in tabs
    """
    for key, value in schema_model.to_schema().columns.items():
        if isinstance(value, dict):
            pretty_print_schema(value, indent + 1)
        else:
            print('\t' * indent + str(key) + ' (' + str(value.dtype) + ')')

def pretty_print_enum(enum_to_print: Type[Enum]) -> None:
    entries = [e for e in enum_to_print]

    for entry in entries:
        print(entry.value)

### Detector Data Frame

The detector data frame contains all the pmt information in one table. The table comply to the following schema:

In [2]:
pretty_print_schema(DetectorSchema)

pmt_id (int64)
pmt_efficiency (float64)
pmt_area (float64)
pmt_noise_rate (float64)
pmt_location_x (float64)
pmt_location_y (float64)
pmt_location_z (float64)
pmt_orientation_x (float64)
pmt_orientation_y (float64)
pmt_orientation_z (float64)
module_id (int64)
module_radius (float64)
module_location_x (float64)
module_location_y (float64)
module_location_z (float64)
string_id (int64)
string_location_x (float64)
string_location_y (float64)
string_location_z (float64)


### Records Data Frame

The records contain all the information of the generated records. Whether noise or event, all should be saved in this format. The basic format is:

In [3]:
pretty_print_schema(RecordSchema)

time (float64)
duration (float64)
record_id (int64)
type (str)


Depending on the record type, it can be extended. For events the full record schema is:

In [4]:
pretty_print_schema(EventRecordSchema)

time (float64)
duration (float64)
record_id (int64)
type (str)
location_x (float64)
location_y (float64)
location_z (float64)
orientation_x (float64)
orientation_y (float64)
orientation_z (float64)
energy (float64)
particle_id (int64)
length (float64)


And for noise records:

In [5]:
pretty_print_schema(NoiseRecordSchema)

time (float64)
duration (float64)
record_id (int64)
type (str)


Currently, the following types are "officially" recognized:

In [14]:
pretty_print_enum(RecordType)

starting_track
cascade
realistic_track
electrical


### Sources Group

The sources group contains multiple dataframes below the path of its `record_id`. Thus, each sources can be accessed by the path `sources/{record_id}`. The Schema of a source is:

In [7]:
pretty_print_schema(SourceRecordSchema)

time (float64)
duration (float64)
record_id (int64)
type (str)
location_x (float64)
location_y (float64)
location_z (float64)
orientation_x (float64)
orientation_y (float64)
orientation_z (float64)
number_of_photons (int64)


Now let's look at the types:

In [13]:
pretty_print_enum(SourceType)

cherenkov
isotropic


### Hits Group

Similar to the sources group, the hits group is as well substructured by the `record_id`. Hence, the hits for one record can be accessed by `hits/{record_id}`. Let's look at the schema:

In [18]:
pretty_print_schema(HitSchema)

time (float64)
duration (float64)
record_id (int64)
type (str)
string_id (int64)
module_id (int64)
pmt_id (int64)


The types are the same as for the records.