# Serverless HDF Database

Think of the following two scenarios:

 - (a) We have a file with many groups and attributes. We are looking for a specific dataset based on a specific attribute
 - (b) We have multiple HDF5 files written. Now, we need to select files which contain a certain meta information

In both cases, we don't need to fill a database first, we can just **use the files as a database**.

In case (a) the code search in all groups and checks the request. The same is done in case (b) with the difference, that this is applied on multiple files. The approach is a very brute-force one, however in most cases, this might be still fast enough. Note, that writing a database at the beginning would also have been taking some time. Also note, that HDF attributes are quickly read because the raw data is untouched even if the files are opened.

The queries are very similar to the `mongodb`-syntax.

In [1]:
import h5rdmtoolbox as h5tbx

h5tbx.use(None)

Let's build a file first:

In [2]:
with h5tbx.File() as h5:
    h5.attrs['project'] = 'tutorial'
    h5.create_dataset('velocity', data=[1,2,-1], attrs=dict(units='m/s', standard_name='x_velocity'))
    g = h5.create_group('sub')
    g.create_dataset('velocity', data=[4,0,-3,12,3], attrs=dict(units='m/s', standard_name='x_velocity'))
    h5.dump()
    filename = h5.hdf_filename

## Find in within a file

We can filter for attributes and properties. The find functions return so-called "lazy HDF5 objects". They behave like `Dataset` or `Groups` but are "offline", hence we can access all properties and attributes without opening the file. Only if the dataset is sliced, the file is opened and closed in the background. This is might be a bit slower in some cases but much more convenient to use.


To find within a single file, call `h5tbx.FileDB`:

In [3]:
results = h5tbx.FileDB(filename).find({'$basename': 'velocity'}, '$dataset')
results

[<h5rdmtoolbox.database.lazy.LDataset at 0x17e317828e0>,
 <h5rdmtoolbox.database.lazy.LDataset at 0x17e317829d0>]

The result objects are "lazy objects". We can access all attributes and poperties. Only when slicing, the file is actually opened:

In [4]:
results[0].attrs

{'standard_name': 'x_velocity', 'units': 'm/s'}

In [5]:
results[0].properties

{'ndim': 1,
 'shape': (5,),
 'dtype': dtype('int32'),
 'size': 5,
 'chunks': (5,),
 'compression': 'gzip',
 'compression_opts': 5,
 'shuffle': False,
 'fletcher32': False,
 'maxshape': (5,),
 'fillvalue': 0,
 'scaleoffset': None,
 'external': None}

In [6]:
results[0][:]

## Find in within one or multiple Folders
To find within a folder, call `h5tbx.FolderDB`. Pass `rec=False` if recursive search for files is not wanted:

In [14]:
h5tbx.FolderDB(filename.parent, rec=True).find_one({'$basename': 'velocity'}, '$dataset').name

'/velocity'

## Examples of queries:

In [16]:
h5tbx.FileDB(filename).find({'$name': '/velocity'}, '$dataset')

[<h5rdmtoolbox.database.lazy.LDataset at 0x17e201eeb80>]

In [17]:
h5tbx.FileDB(filename).find({'$basename': {'$regex': 'sub'}}, '$group', rec=False)

[<h5rdmtoolbox.database.lazy.LGroup at 0x17e201e0910>]

In [18]:
h5tbx.FileDB(filename).find({'$shape': (5,)}, '$dataset')

[<h5rdmtoolbox.database.lazy.LDataset at 0x17e201f9d00>]

In [19]:
h5tbx.FileDB(filename).find({'$ndim': 1}, '$dataset')

[<h5rdmtoolbox.database.lazy.LDataset at 0x17e317c75e0>,
 <h5rdmtoolbox.database.lazy.LDataset at 0x17e201ee3a0>]

### Query from an opened file:

In [20]:
from pprint import pprint

with h5tbx.File(filename) as h5:
    print('find basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset'))
    
    print('\nfind name=/velocity in root:')
    pprint(h5.find({'$name': '/velocity'}, '$dataset'))
    
    print('\nfind name=/sub/velocity in root:')
    pprint(h5.find({'$name': '/sub/velocity'}, '$dataset'))
    
    print('\nfind basename=velocity in sub/:')
    pprint(h5['sub'].find({'$basename': 'velocity'}, '$dataset', rec=False))
    
    print('\nfind basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset', rec=False))
    
    print('\nfind basename=velocity in root:')
    pprint(h5.find({'$basename': 'velocity'}, '$dataset', rec=True))

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">]

find name=/velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">]

find name=/sub/velocity in root:
[<HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">]

find basename=velocity in sub/:
[<HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">]

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">]

find basename=velocity in root:
[<HDF5 dataset "velocity": shape (3,), type "<i4", convention "h5py">,
 <HDF5 dataset "velocity": shape (5,), type "<i4", convention "h5py">]
