# HDF5 + mongoDB

This databases uses `pymongo` as the backend database. Only meta data (or part of it) is stored in the database, not the raw data. To understand how dataset raw data is linked to the database see the respective chapter in this document.

In [1]:
import pymongo
from pymongo import MongoClient

from h5rdmtoolbox import tutorial
import h5rdmtoolbox as h5tbx

import numpy as np

from pprint import pprint

h5tbx.__version__

tmp: C:\Users\da4323\AppData\Local\h5rdmtoolbox\h5rdmtoolbox\tmp\tmp1


'0.1.12'

## First things first: Connection to the DB:
Connect to the mongod client:

In [2]:
client = MongoClient()
client

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

Create a database and a (test) collection named "digits":

In [12]:
db = client['h5database_notebook_tutorial']
collection = db['digits']

# drop all content in order to start from scratch:
collection.drop()

## Testdata
We will take test data from scikit-learn, namely the hand-written digits ((https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html):

In [13]:
# !pip install scikit-learn
# !pip install scikit-image

In [14]:
from sklearn.datasets import load_digits
digits = load_digits()

Fill a HDF5 file with the loaded data. We additionally compute the mean count and two gray occurance properties (dissimilarity and correlation). Those three datasets together with the true digit of the image are linked to the image via HDF dimension scales:

In [15]:
from skimage.feature import graycomatrix, graycoprops

filename = h5tbx.generate_temporary_filename()

with h5tbx.H5File(filename, 'w') as h5:
    ds_trg = h5.create_dataset('digit', data=digits.target, units='', long_name='true digit', make_scale=True)
    ds_img = h5.create_dataset('images', shape=(len(digits.images), 8, 8), units='pixel', long_name='image of hand-written digit')
    
    ds_mean = h5.create_dataset('mean', shape=(len(digits.images), ), units='counts', long_name='mean counts of the image', make_scale=True)
    ds_diss = h5.create_dataset('dissimilarity', shape=(len(digits.images), ), units='counts', long_name='dissimilarity', make_scale=True)
    ds_corr = h5.create_dataset('correlation', shape=(len(digits.images), ), units='counts', long_name='correlation', make_scale=True)
    
    
    for i, img in enumerate(digits.images):
        ds_img[i, :, :] = img
        ds_mean[i] = np.mean(img)
        
        glcm = graycomatrix(img.astype(int), distances=[5], angles=[0], levels=256,
                            symmetric=True, normed=True)
        ds_diss[i] = graycoprops(glcm, 'dissimilarity')[0, 0]
        ds_corr[i] = graycoprops(glcm, 'correlation')[0, 0]
        
    ds_img.dims[0].attach_scale(ds_trg)
    ds_img.dims[0].attach_scale(ds_mean)
    ds_img.dims[0].attach_scale(ds_diss)
    ds_img.dims[0].attach_scale(ds_corr)
    h5.dump()

## Filling the database

To insert data from the HDF5 file into the DB, we need the accessor "mongo":

In [16]:
from h5rdmtoolbox.h5database import mongo

In [20]:
with h5tbx.H5File(filename) as h5:
    h5['images'].mongo.insert(0, collection, update=True)

Count the number of collections inserted:

In [21]:
collection.count_documents({})

3594

### Find one:

In [19]:
one_res = collection.find_one({'digit': {'$eq': 3}})
one_res

{'_id': ObjectId('631efe87037a49b2d4a8673f'),
 'filename': 'C:\\Users\\da4323\\AppData\\Local\\h5rdmtoolbox\\h5rdmtoolbox\\tmp\\tmp1\\tmp3',
 'name': '/images',
 'basename': 'images',
 'file_creation_time': datetime.datetime(2022, 9, 12, 9, 40, 19, 703000),
 'shape': [1797, 8, 8],
 'ndim': 3,
 'hdfobj': 'dataset',
 'slice': [[3, 4, 1], [0, None, 1], [0, None, 1]],
 'digit': 3,
 'mean': 4.171875,
 'dissimilarity': 4.875,
 'correlation': -0.3547935485839844,
 'long_name': 'image of hand-written digit',
 'units': 'pixel'}

We found one entry only because we asked for one only. Note, that the sult dictionary provides a "slice" entry. This is the slice within the 3D array in the HDF file. We can use this to slice the array. There is even a method in the accessor to simplify this:

In [None]:
with h5tbx.H5File(filename) as h5:
    h5.images.mongo.slice(one_res['slice']).plot(cmap='gray')

## Find many:

Let' query a rang of data. Mean count shall be above 5 counts and the digit is >3 and <=8:

In [None]:
collection.count_documents({'mean': {'$gt': 5}, 'digit': {'$gt': 3}, 'digit': {'$lte': 8}})
many_res = collection.find({'mean': {'$gt': 5}, 'digit': {'$gt': 3}, 'digit': {'$lte': 8}})

Inspect the result by the help of pandas:

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data=[r for r in many_res.rewind()])
df.drop('_id', inplace=True, axis=1)
df.drop('filename', inplace=True, axis=1)
df.drop('slice', inplace=True, axis=1)
df.drop('long_name', inplace=True, axis=1)

pd.plotting.scatter_matrix(df[['mean', 'dissimilarity', 'correlation', 'digit']], hist_kwds={'bins': 20})
df.head()

## Query for other meta data.

First of all we could have insert the full file. We might have decided to insert only a group content or really all data in the file, thus a recursive run that insert all data. Ok, let's do that:

In [None]:
db = client['h5database_notebook_tutorial']
collection_full_digits = db['full_digits']

# drop all content in order to start from scratch:
collection_full_digits.drop()

In [None]:
with h5tbx.H5File(filename) as h5:
    h5.mongo.insert(collection_full_digits, recursive=True)

The first entry looks like this:

In [None]:
collection_full_digits.find_one({})

It is the data of the root group. It shows all attribute of the group.