# H5Database

Three concepts are provided in the scope of this sub-package:
- ``H5repo``: Using external HDF5 links. A HDF5 file serves as a "table of content" to link to files within a classic file system
- ``H5Files``: Allows to open multiple files at the same time and brwose through them
- ``h5mongo``: Using pymongo (mongodb) to mirrow meta data of hdf files 

In [None]:
from h5rdmtoolbox import h5database as h5db
from h5rdmtoolbox import generate_temporary_directory
from h5rdmtoolbox.h5database import tutorial

In [None]:
tocdir = generate_temporary_directory('test_repo')
tutorial.build_test_repo(tocdir)

## H5Repo - External link based reository

Initialize a `H5Repo` object and specify the root directory under which HDF files are placed:

In [None]:
repo = h5db.H5repo(tocdir)

The object creates a `toc` file (toc=table of content) which is a HDF5 file with external links to the found HDF files:

In [None]:
repo.toc_filename.name

The content can be dumped to the screen as a (pandas-) table:

In [None]:
repo.dump(full_path=False)  # minimizes the output (no full folder path is shown)

The entries can be indexed and the file content is shown:

In [None]:
repo[0]

### Filtering

The repository can be filtered in a HDF5-like syntax. First import all filter classes from the module `filter_classes`:

In [None]:
from h5rdmtoolbox.h5database.filter_classes import *
# repo.list_attribute_values('operator', '/')

The filter method requires an object `Entry`. It is the access location within a file, here the group "operation_point" in the root group. In the example the repository is filtered for the attribute "long_name" equal to "Operation point data group". A sub-repository is returend which is again an HDF5 file with external links - but this time only to the HDF files matching the filter request:

In [None]:
%%time
sub_repo = repo.filter(Entry['/operation_point'].attrs['long_name'] == 'Operation point data group')

In [None]:
sub_repo.dump(False)

The elsaped time for the filter request and building the new HDF toc-file is:

In [None]:
sub_repo.elapsed_time  # [s]

Evaluating the sub-repository is quite straight forward as we are still working with HDF5 files. Let's plot data from the filter results:

In [None]:
%%time
import matplotlib.pyplot as plt

plt.figure()
for r in sub_repo:
    with r as h5:
        if 'operation_point' in h5:
            plt.scatter(h5['operation_point']['vfr'].attrs['mean'], h5['operation_point']['ptot'].attrs['mean'])
plt.xlabel('vfr')
plt.ylabel('ptot')
plt.show()

## H5Files - Accessing multiple HDF files

This concepts assumes that we already know the HDF files. This might be a result from above

In [None]:
from h5rdmtoolbox.h5database import H5Files

In [None]:
sub_repo[0:3]

In [None]:
with H5Files(*[sr.filename for sr in sub_repo[0:4]]) as h5files:
    print(h5files.keys())
    h5files[list(h5files.keys())[0]].dump()

## HDF and PyMongo

Last but not least `h5database` provides a "real" database solution using `pymongo`. Here, not the file sbut the meta informations are written to so-called `collections`:

In [None]:
import pymongo
from pymongo import MongoClient

In [None]:
client = MongoClient()
client

In [None]:
db = client['h5database_notebook_tutorial']
collection = db['test']
collection.drop() # delete all entries if already exist

Import the mongo module (will add the accessor `mongo` to datasets and groups)

In [None]:
from h5rdmtoolbox.h5database import mongo

In [None]:
import h5rdmtoolbox as h5tbx

for fname in repo.filenames:
    with h5tbx.H5File(fname) as h5:
        h5.mongo.insert(collection=collection, recursive=True)

Let's inspect the found database entries:

In [None]:
from pprint import pprint

Let's do the equivalent filter request as before (`sub_repo = repo.filter(Entry['/operation_point'].attrs['long_name'] == 'Operation point data group')`)

In [None]:
%%time
res = collection.find({'long_name': 'Operation point data group'})

The number of found files are the same as before:

In [None]:
len(sub_repo), collection.count_documents({'long_name': 'Operation point data group'})

Let's generate the equivalent plot as before:

In [None]:
%%time
plt.figure()

for r in res.rewind():
    with h5tbx.H5File(r['filename']) as h5:
        if 'operation_point' in h5:
            plt.scatter(h5['operation_point']['vfr'].attrs['mean'], h5['operation_point']['ptot'].attrs['mean'])
plt.xlabel('vfr')
plt.ylabel('ptot')
plt.show()