# Serverless HDF Database

This section explains how specific **datasets or groups can be found** using the `h5rdmtoolbox`

Think of the following two scenarios:

 - (a) We have **one HDF5 file** with many groups and attributes. We are looking for a specific dataset based on a specific attribute
 - (b) We generated **multiple HDF5 files**. Now, we need to select files which contain a certain meta information.

In both cases, we don't need to fill a database first, we can just **use the files as a database**.

In case (a) all attributes are matched for the request, i.e. an attribute matches a certain value. The same is done in case (b) with the difference, that this is applied on multiple files. The approach is a very brute-force one, however in most cases, this might be still fast enough. Note, that writing a database at the beginning would also have been taking some time. Also note, that HDF attributes are quickly read because the raw data is untouched even if the files are opened.

If you are already familiar to the syntax of `mongodb`, you will recognize the similarities, otherwise you will learn it now.

In [1]:
import h5rdmtoolbox as h5tbx

h5tbx.use(None)

using("h5py")

Let's build a file first:

In [2]:
with h5tbx.File() as h5:
    h5.write_iso_timestamp(name='timestamp', dt=None) # writes the current date time in iso format to the attribute
    h5.attrs['project'] = 'tutorial'
    h5.create_dataset('velocity', data=[1,2,-1], attrs=dict(units='m/s', standard_name='x_velocity'))
    g = h5.create_group('group1')
    g.create_dataset('velocity', data=[4,0,-3,12,3], attrs=dict(units='m/s', standard_name='x_velocity'))
    g = h5.create_group('group2')
    g.create_dataset('velocity', data=[12,11.3,4.6,7.3,8.1], attrs=dict(units='m/s', standard_name='x_velocity'))
    h5.dump()
    filename = h5.hdf_filename

## Search/Find within a file

The method `.find()` can be called from any group. It expects a dictionary. As said the syntax is very similar to the one of `pymongo`. In any case here is how the filter request is built up:

  `find({<keyword>: <value>}, <object_filter>)`
  
- keyword is either
    - an attribute name or
    - a property of a dataset or group object, like "name", "shape", "dtype", ...
- To indicate, that you are searching for a porperty, use a "$" in front

- The object filter can be either "$dataset" or "group". If not provided results from either of the objects are returned

In [3]:
with h5tbx.File(filename) as h5:
    results = h5.find({'$basename': 'velocity'}, '$dataset')
    print(results)

[<HDF5 dataset "velocity": shape (5,), type "<f8", convention "h5py">, <HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">, <HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">]


The above code has one **restriction**: We cannot use `results` anymore after the file has been closed:

In [4]:
results

[<Closed HDF5 dataset (convention "h5py")>,
 <Closed HDF5 dataset (convention "h5py")>,
 <Closed HDF5 dataset (convention "h5py")>]

To overcome this issue, we can use the file objet from the subpackage `database`. The returned objects are called "lazy" datasets and groups respectively. They are wrapper around the closed objects. Properties and attributes are stored in them and if data is requested, the file is opnend and closed in the background:

In [5]:
results = h5tbx.database.File(filename).find({'$basename': 'velocity'}, '$dataset')
results

[<LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/group2/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>]

Lazy datasets for example can be sliced even if the file is closed already (the file gets opened upon request):

The result objects are "lazy objects". We can access all attributes and poperties. Only when slicing, the file is actually opened:

In [6]:
results[0]

<LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>

In [7]:
results[0][()]

## Shortcuts

Sometimes the above query dictionary might be a bit unhandy. For example, finding any dataset with a certain attribute no matter the value is like querying "Give me every dataset that has this attribute".

For this to put into a query, we need a regex formulation like so: `{'standard_name': {'$regex': '.*'}}`. The shortcut here is to just pass the attribute we are looking:

In [8]:
h5tbx.database.File(filename).find_one('standard_name', '$dataset')

<LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>

In [9]:
h5tbx.database.File(filename).find_one(['standard_name', 'units'], '$dataset')

<LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>

## Advanced search

Until here we searched for **exact** matches of attribut or property values. Sometimes it is required to apply different operators. Examples:

 - \$regex: Match the attribute string with a pattern
 - \$gt: Find only objects where the attribute/property is greater than a given value
 - \$lt: ...
 - ...

In [10]:
results = h5tbx.database.File(filename).find({'timestamp': {'$regex': '.*'}}, '$group')
results[0].attrs['timestamp']

'2023-08-31T12:07:09.528075'

### User-defined operator
For the given example, we cannot perform the comparison, if the timestamp is greater (or smaller) than a given timestamp, hence, if the data is from a given point in time. For this, we need to write our own operator:

In [11]:
import re
from typing import Dict, Union
from datetime import datetime
from h5rdmtoolbox.database import file

def isodatetime_operator(value, flt: Union[datetime, Dict]) -> bool:
    if value is None:
        return False
        
    if isinstance(flt, dict):
        av_flt = ('$gt', '$gte', '$lt', '$lte', '$eq')
        for k in flt:
            if k not in av_flt:
                raise KeyError(f'Invalid filter operator: {k}, expected one of these: {av_flt}')
    elif isinstance(flt, str):
        raise TypeError(f'You must pass a datetime object, not {type(value)}')
    else:
        flt= {'$eq': value}

    # only check attributes that are string or datetime
    if isinstance(value, str):
        pattern = '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d+'
        match = re.search(pattern, value)
        if match is None:
            return False
        if match.group() == '':
            return False
        value = datetime.fromisoformat(value)
    elif isinstance(vaue, datetime):
        pass
    else:
        return False

    # No perform the actual datetime comparison:
    for k, v in flt.items():
        if not h5tbx.database.file.operator[k](value, v):
            return False
    return True

In [12]:
file.operator['$isodatetime'] = isodatetime_operator

In [13]:
from datetime import datetime

# ToDo Files(filename.parent).find(---) does not work!
results = h5tbx.database.File(filename).find({'$name': '/',
                                              'timestamp': {'$isodatetime': {'$lt': datetime.now(),
                                                                             '$gt': datetime(2020, 6, 4)}}},
                                             '$group')
results

[<LGroup "/" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">]

## Find in within one or multiple Folders
To find within a folder, call `h5tbx.FolderDB`. Pass `rec=False` if recursive search for files is not wanted:

In [14]:
h5tbx.database.Folder(filename.parent, rec=True).find_one({'$basename': 'velocity'}, '$dataset').name

'/group2/velocity'

## Examples of queries:

In [15]:
h5tbx.database.File(filename).find({'$name': '/velocity'}, '$dataset')

[<LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>]

In [16]:
h5tbx.database.File(filename).find({'$basename': {'$regex': 'group'}}, '$group', rec=False)

[<LGroup "/group2" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LGroup "/group1" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">]

In [17]:
h5tbx.database.File(filename).find({'$shape': (5,)}, '$dataset')

[<LDataset "/group2/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>]

In [18]:
datasets = h5tbx.database.File(filename).find({'$ndim': 1}, '$dataset')

In [19]:
from h5rdmtoolbox.database import file

In [20]:
sorted(datasets)

[<LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/group2/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>]

In [21]:
sorted(datasets)[0].find

<bound method LGroup.find of <LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>>

In [22]:
with sorted(datasets)[0] as ds:
    r = file.find(ds, {'$name': {'$regex': 'group[0-9]'}}, '$dataset', False, True, False)
    print(r)

<HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">


### Query from an opened file:

In [23]:
from pprint import pprint

with h5tbx.File(filename) as h5:
    print('find basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset'))
    
    print('\nfind name=/velocity in root:')
    pprint(h5.find({'$name': '/velocity'}, '$dataset'))
    
    print('\nfind name=/sub/velocity in root:')
    pprint(h5.find({'$name': '/sub/velocity'}, '$dataset'))
    
    print('\nfind basename=velocity in group1/:')
    pprint(h5['group1'].find({'$basename': 'velocity'}, '$dataset', rec=False))
    
    print('\nfind basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset', rec=False))
    
    print('\nfind basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset', rec=True))

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">,
 <HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<f8", convention "h5py">]

find name=/velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">]

find name=/sub/velocity in root:
[]

find basename=velocity in group1/:
[<HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">]

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">]

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">,
 <HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<f8", convention "h5py">]


In [24]:
import pathlib

class FileDB:
    """User-friendly interface to database.Folder, database.File or database.Files"""

    def __new__(cls, path, rec=False):
        if isinstance(path, (list, tuple)):
            filenames = []
            for p in [pathlib.Path(_p) for _p in path]:
                if p.is_file():
                    filenames.append(p)
                elif p.is_dir():
                    if rec:
                        for f in p.rglob('*.hdf'):
                            filenames.append(f)
                    else:
                        for f in p.glob('*.hdf'):
                            filenames.append(f)
            return h5tbx.database.Files(filenames)
            
        path = pathlib.Path(path)
        if path.is_dir():
            return h5tbx.database.Folder(path)
        return h5tbx.database.File(path)

In [25]:
FileDB([filename, ]).find({})

[<LGroup "/group1" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LDataset "/group2/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LGroup "/" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LGroup "/group2" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">]

In [26]:
for f in FileDB([filename, ]).values():
    print(f)

In [27]:
FileDB(filename).find({})

[<LGroup "/" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LDataset "/group2/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LGroup "/group2" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LGroup "/group1" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>]

In [28]:
FileDB(filename.parent).find({})

[<LGroup "/" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LGroup "/group2" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LDataset "/group1/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LDataset "/group2/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>,
 <LGroup "/group1" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf">,
 <LDataset "/velocity" in "C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp_0\tmp0.hdf" attrs=standard_name=x_velocity, units=m/s>]