# 03 Create Dataset

In the previous two examples we have seen how to import, visualize and treat the data. This Notebook is about creating a dataset for training.

## Table of Contents

* [General Idea](#idea)
* [Load Dependencies](#dependencies)
* [Load Collection to Transform](#events)
* [Create Multi Histogram Dataset](#create)
* [Dataset investigation](#investigate)


## General idea <a class="anchor" href="idea"></a>

We want to generate an efficient dataset for training later on. To try different noise levels, we work with the pure histograms and let the noise be added on the fly later. Each generated dataset contains of three parts:

1. **Index file in `feather` format:** Contains the event data and the filename where the histogram can be found relative
2. **Configuration file:** Json format containing the detector, etc.
3. **Histogram files:** Files containing hits as histograms depending on the detector

All of those explanations will be demonstrated at the section about [Dataset investigation](#investigate).

## Load Dependencies <a class="anchor" href="dependencies"></a>

In [39]:
%load_ext autoreload
%autoreload 2

import sys

sys.path.append("../")
sys.path.append("../../olympus")

import json
import os

import pandas as pd
import numpy as np
import shutil

from apollo.data.importers import EventCollectionImporter
from apollo.utils.detector_helpers import get_line_detector
from apollo.dataset.generators import MultiHistogramGenerator
from apollo.data.configs import Interval, HistogramConfig, HistogramDatasetConfig
from apollo.data.events import EventTimeframeMode
from apollo.visualization.events import plot_histogram, plot_timeline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load events to transform <a class="anchor" href="events"></a>

In [10]:
detector = get_line_detector()
event_collection = EventCollectionImporter.from_pickle(
    "../../data/all/events_track_0.pickle", detector=detector
)

## Create Multi Histogram Dataset <a class="anchor" href="create"></a>

Now we are ready to create the dataset that we have been waiting for.

In [42]:
interval = Interval(0, 1000)
histogram_config = HistogramConfig(start=interval.start, end=interval.end, bin_size=10)

# make sure not everything starts at 0
event_collection.redistribute(
    interval, is_in_timeframe_mode=EventTimeframeMode.CONTAINS_HIT
)


dataset = MultiHistogramGenerator(
    event_collection=event_collection, histogram_config=histogram_config
)

save_path = "../data/processed/notebooks"

if os.path.exists(save_path):
    shutil.rmtree(save_path)

dataset.generate(save_path)



## Dataset investigation <a class="anchor" href="investigate"></a>

Let's see what we just created. First the Configuration

In [23]:
with open(os.path.join(save_path, "config.json"), "rb") as config_file:
    dictionary = json.load(config_file)

HistogramDatasetConfig.from_json(dictionary)

HistogramDatasetConfig(path='../data/processed/test', detector=Detector(modules=['Module [0, 0], Point (x: 0.0, y: 0.0, z: -500.0) [m], 8.861518238737425e-05 [Hz], 0.12755102040816327', 'Module [0, 1], Point (x: 0.0, y: 0.0, z: -447.36842105263156) [m], 6.764606613159678e-05 [Hz], 0.12755102040816327', 'Module [0, 2], Point (x: 0.0, y: 0.0, z: -394.7368421052632) [m], 9.214486243223132e-05 [Hz], 0.12755102040816327', 'Module [0, 3], Point (x: 0.0, y: 0.0, z: -342.10526315789474) [m], 0.00013838750643446082 [Hz], 0.12755102040816327', 'Module [0, 4], Point (x: 0.0, y: 0.0, z: -289.47368421052636) [m], 0.00010194159159632369 [Hz], 0.12755102040816327', 'Module [0, 5], Point (x: 0.0, y: 0.0, z: -236.84210526315792) [m], 0.0001569002364322399 [Hz], 0.12755102040816327', 'Module [0, 6], Point (x: 0.0, y: 0.0, z: -184.21052631578948) [m], 8.85974593183028e-05 [Hz], 0.12755102040816327', 'Module [0, 7], Point (x: 0.0, y: 0.0, z: -131.5789473684211) [m], 0.0006373218102404027 [Hz], 0.127551020

Next up, we check the created index file.

In [27]:
index_data = pd.read_feather(os.path.join(save_path, "index.h5"))

index_data.head()

Unnamed: 0,events,file
0,[],data/histogram_95103ef2-5cdc-423f-a366-d4a333f...
1,"[{'default_value': 0.0, 'direction': {'x': -0....",data/histogram_650df277-f2c4-41a5-af65-0389f23...
2,[],data/histogram_174f75f8-84a6-4dc5-80b3-748adab...
3,[],data/histogram_019a8d91-b358-49c6-a45d-0bf93fd...
4,[],data/histogram_8f726e4c-37a2-4fc7-b468-e3830aa...


Last but not least, let's check one histogram:

In [37]:
file_path = os.path.join(save_path, index_data.iloc[1]["file"])

histogram = np.load(file_path)
print(histogram)
print("hits: ", np.sum(histogram))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
hits:  3.0


What a fantastic success. We have now a way to create a dataset and save it to file