Skip to content

Commit

Permalink
Implementaion of Container and mixed loaders (H4EP001)
Browse files Browse the repository at this point in the history
With hickle 4.0.0 the code for dumping and loading dedicated objects
like scalar values or numpy arrays was moved to dedicated loader
modules. This first step of disentangling hickle core machinery from
object specific included all objects and structures which were mappable
to h5py.Dataset objects.

This commit provides an implementaition of hickle extension proposal
H4EP001 (telegraphic#135). In this
proposal the extension of the loader concept introduced by hickle 4.0.0
towards generic PyContainer based and mixed loaders specified.

In addition to the proposed extension this proposed implementation inludes
the following extensions hickle 4.0.0 and H4EP001

H4EP001:
========
    PyContainer Interface includes a filter method which allows loaders
    when data is loaded to adjust, suppress, or insert addtional data subitems
    of h5py.Group objects. In order to acomplish the temorary modification
    of h5py.Group and h5py.Dataset object when file is opened in read
    only mode the H5NodeFilterProxy class is provided. This class will
    store all temporary modifications while the original h5py.Group
    and h5py.Dataset object stay unchanged

hickle 4.0.0 / 4.0.1:
=====================
    Strings and arrays of bytes are stored as Python bytearrays and not as
    variable sized stirngs and bytes. The benefit is that hdf5 filters
    and hdf5.compression filters can be applied to Python bytearrays.
    The down is that data is stored as bytes of int8 datatype.
    This change affects native Python string scalars as well as numpy
    arrays containing strings.

    Extends pickle loader create_pickled_dataset function to support
    Python copy protocol as proposed by issue
    telegraphic#125
    For this a dedicated PickledContainer is implemented to handle
    all objects which have been stored using Python copy protocol.

    numpy.masked array is now stored as h5py.Group containin a dedicated
    dataset for data and mask each.

    scipy.sparce matrices now are stored as h5py.Group with containing
    the datasets data, indices, indptr and shape

    dictionary keys are now used as names for h5py.Dataset and
    h5py.Group objects.

    Only string, bytes, int, float, complex, bool and NonType keys are
    converted to name strings, for all other keys a key-value-pair group
    is created containg the key and value as its subitems.

    string and bytes keys which contain slashes are converted into key
    value pairs instead of converting slashes to backslashes.
    Distinction between hickle 4.0.0 string and byte keys with converted
    slashes is made by enclosing sting value within double quotes
    instead of single qoutes as donw by Python repr function or !r or %r
    string format specifiers. Consequently on load all string keys which
    are enclosed in single quotes will be subjected to slash conversion
    while any others will be used as ar.

    h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle'
    are on load automatically get assigned object as their py_object_type.
    The related 'type' attribute is ignored. h5py.Group and h5py.Dataset
    objects which do not expose a 'base_type' attribute are assumed to
    either contain pickle string or conform to copy protocol and thus
    get implicitly assigned 'pickle' base type. Thus on dump for all
    h5py.Group and h5py.Dataset objects which contain pickle strings or
    conform to Python copy protocol 'base_type' and 'type' attributes
    are ommited as their values are 'pickle' and object respective.

Other stuff:
============
    Full separation between hickle core and loaders

    Distinct unit tests for individual loaders and hickle core

    Cleanup of not any more required functions and classes

    Simplification of recursion on dump and load through self contained
    loader interface.

    is capbable to load hickle 4.0.0, 4.0.1 files which do not yet
    support PyContainer concept beyond list, tuple, dict and set
  • Loading branch information
hernot committed Jul 29, 2020
1 parent 861e4ae commit a638518
Show file tree
Hide file tree
Showing 24 changed files with 3,978 additions and 1,510 deletions.
259 changes: 167 additions & 92 deletions hickle/helpers.py
Original file line number Diff line number Diff line change
@@ -1,116 +1,191 @@
# %% IMPORTS
# Built-in imports
import re
import operator
import typing
import types
import collections
import numbers

# Package imports
import dill as pickle


# %% FUNCTION DEFINITIONS
def get_type(h_node):
""" Helper function to return the py_type for an HDF node """
base_type = h_node.attrs['base_type']
if base_type != b'pickle':
py_type = pickle.loads(h_node.attrs['type'])
else:
py_type = None
return py_type, base_type


def get_type_and_data(h_node):
""" Helper function to return the py_type and data block for an HDF node"""
py_type, base_type = get_type(h_node)
data = h_node[()]
return py_type, base_type, data
# %% EXCEPTION DEFINITIONS

nobody_is_my_name = ()

def sort_keys(key_list):
""" Take a list of strings and sort it by integer value within string
Args:
key_list (list): List of keys
Returns:
key_list_sorted (list): List of keys, sorted by integer
class NotHicklable(Exception):
"""
object can not be mapped to proper hickle HDF5 file structure and
thus shall be converted to pickle string before storing.
"""
pass

# Py3 h5py returns an irritating KeysView object
# Py3 also complains about bytes and strings, convert all keys to bytes
key_list2 = []
for key in key_list:
if isinstance(key, str):
key = bytes(key, 'ascii')
key_list2.append(key)
key_list = key_list2

# Check which keys contain a number
numbered_keys = [re.search(br'\d+', key) for key in key_list]

# Sort the keys on number if they have it, or normally if not
if(len(key_list) and not numbered_keys.count(None)):
return(sorted(key_list,
key=lambda x: int(re.search(br'\d+', x).group(0))))
else:
return(sorted(key_list))


def check_is_iterable(py_obj):
""" Check whether a python object is a built-in iterable.
Note: this treats unicode and string as NON ITERABLE
Args:
py_obj: python object to test
# %% CLASS DEFINITIONS

Returns:
iter_ok (bool): True if item is iterable, False is item is not
class PyContainer():
"""
Abstract base class for all PyContainer classes acting as proxy between
h5py.Group and python object represented by the content of the h5py.Group.
Any container type object as well as complex objects are represented
in a tree like structure on HDF5 file which PyContainer objects ensure to
be properly mapped before beeing converted into the final object.
Parameters:
-----------
h5_attrs (h5py.AttributeManager):
attributes defined on h5py.Group object represented by this PyContainer
base_type (bytes):
the basic type used for representation on the HDF5 file
object_type:
type of Python object to be restored. Dependent upon container may
be used by PyContainer.convert to convert loaded Python object into
final one.
Attributes:
-----------
base_type (bytes):
the basic type used for representation on the HDF5 file
object_type:
type of Python object to be restored. Dependent upon container may
be used by PyContainer.convert to convert loaded Python object into
final one.
"""

# Check if py_obj is an accepted iterable and return
return(isinstance(py_obj, (tuple, list, set)))

__slots__ = ("base_type", "object_type", "_h5_attrs", "_content","__dict__" )

def __init__(self,h5_attrs, base_type, object_type,_content = None):
"""
Parameters (protected):
-----------------------
_content (default: list):
container to be used to collect the Python objects representing
the sub items or the state of the final Python object. Shall only
be set by derived PyContainer classes and not be set by
"""
# the base type used to select this PyContainer
self.base_type = base_type
# class of python object represented by this PyContainer
self.object_type = object_type
# the h5_attrs structure of the h5_group to load the object_type from
# can be used by the append and convert methods to obtain more
# information about the container like object to be restored
self._h5_attrs = h5_attrs
# intermediate list, tuple, dict, etc. used to collect and store the sub items
# when calling the append method
self._content = _content if _content is not None else []

def filter(self,items):
yield from items

def append(self,name,item,h5_attrs):
"""
adds the passed item (object) to the content of this container.
Parameters:
-----------
name (string):
the name of the h5py.Dataset or h5py.Group subitem was loaded from
item:
the Python object of the subitem
h5_attrs:
attributes defined on h5py.Group or h5py.Dataset object sub item
was loaded from.
"""
self._content.append(item)

def convert(self):
"""
creates the final object and populates it with the items stored in the _content slot
must be implemented by the derived Container classes
Returns:
--------
py_obj: The final Python object loaded from file
"""
raise NotImplementedError("convert method must be implemented")


class H5NodeFilterProxy():
"""
Proxy class which allows to temporarily modify h5_node.attrs content.
Original attributes of underlying h5_node are left unchanged.
Parameters:
-----------
h5_node:
node for which attributes shall be replaced by a temporary value
"""

def check_is_hashable(py_obj):
""" Check if a python object is hashable
__slots__ = ('_h5_node','attrs','__dict__')

def __init__(self,h5_node):
self._h5_node = h5_node
self.attrs = collections.ChainMap({},h5_node.attrs)

def __getattribute__(self,name):
# for attrs and wrapped _h5_node return local copy any other request
# redirect to wrapped _h5_node
if name in {"attrs","_h5_node"}:
return super(H5NodeFilterProxy,self).__getattribute__(name)
print("doing")
_h5_node = super(H5NodeFilterProxy,self).__getattribute__('_h5_node')
return getattr(_h5_node,name)

def __setattr__(self,name,value):
# if wrapped _h5_node and attrs shall be set store value on local attributes
# otherwise pass on to wrapped _h5_node
if name in {'_h5_node','attrs'}:
super(H5NodeFilterProxy,self).__setattr__(name,value)
return
_h5_node = super(H5NodeFilterProxy,self).__getattribute__('_h5_node')
setattr(_h5_node,name,value)

def __getitem__(self,*args,**kwargs):
_h5_node = super(H5NodeFilterProxy,self).__getattribute__('_h5_node')
return _h5_node.__getitem__(*args,**kwargs)
# TODO as needed add more function like __getitem__ to fully proxy h5_node
# or consider using metaclass __getattribute__ for handling special methods



Note: this function is currently not used, but is useful for future
development.
# %% FUNCTION DEFINITIONS

Args:
py_obj: python object to test
def not_dumpable( py_obj, h_group, name, **kwargs): # pragma: nocover
"""
create_dataset method attached to dummy py_objects used to mimic container
groups by older versions of hickle lacking generic PyContainer mapping
h5py.Groups to corresponding py_object
Raises:
-------
RuntimeError:
in any case as this function shall never be called
"""

try:
py_obj.__hash__()
return True
except TypeError:
return False


def check_iterable_item_type(iter_obj):
""" Check if all items within an iterable are the same type.
raise RuntimeError("types defined by loaders not dumpable")

Args:
iter_obj: iterable object

Returns:
iter_type: type of item contained within the iterable. If
the iterable has many types, a boolean False is returned instead.

References:
http://stackoverflow.com/questions/13252333
def no_compression(kwargs):
"""

iseq = iter(iter_obj)

try:
first_type = type(next(iseq))
except StopIteration:
return False
except Exception: # pragma: no cover
return False
else:
if all([type(x) is first_type for x in iseq]):
return(first_type)
else:
return(False)
filter which temporarily removes any compression or data filter related
arguments from the kwargs dict.
"""
return {
key:value
for key,value in kwargs.items()
if key not in {"compression","shuffle","compression_opts","chunks","fletcher32","scaleoffset"}
}
Loading

0 comments on commit a638518

Please sign in to comment.