Skip to content

Commit

Permalink
Implementaion of Container and mixed loaders (H4EP001)
Browse files Browse the repository at this point in the history
With hickle 4.0.0 the code for dumping and loading dedicated objects
like scalar values or numpy arrays was moved to dedicated loader
modules. This first step of disentangling hickle core machinery from
object specific included all objects and structures which were mappable
to h5py.Dataset objects.

This commit provides an implementaition of hickle extension proposal
H4EP001 (telegraphic#135). In this
proposal the extension of the loader concept introduced by hickle 4.0.0
towards generic PyContainer based and mixed loaders specified.

In addition to the proposed extension this proposed implementation inludes
the following extensions hickle 4.0.0 and H4EP001

H4EP001:
========
    PyContainer Interface includes a filter method which allows loaders
    when data is loaded to adjust, suppress, or insert addtional data subitems
    of h5py.Group objects. In order to acomplish the temorary modification
    of h5py.Group and h5py.Dataset object when file is opened in read
    only mode the H5NodeFilterProxy class is provided. This class will
    store all temporary modifications while the original h5py.Group
    and h5py.Dataset object stay unchanged

hickle 4.0.0 / 4.0.1:
=====================
    Strings and arrays of bytes are stored as Python bytearrays and not as
    variable sized stirngs and bytes. The benefit is that hdf5 filters
    and hdf5.compression filters can be applied to Python bytearrays.
    The down is that data is stored as bytes of int8 datatype.
    This change affects native Python string scalars as well as numpy
    arrays containing strings.

    numpy.masked array is now stored as h5py.Group containin a dedicated
    dataset for data and mask each.

    scipy.sparce matrices now are stored as h5py.Group with containing
    the datasets data, indices, indptr and shape

    dictionary keys are now used as names for h5py.Dataset and
    h5py.Group objects.

    Only string, bytes, int, float, complex, bool and NonType keys are
    converted to name strings, for all other keys a key-value-pair group
    is created containg the key and value as its subitems.

    string and bytes keys which contain slashes are converted into key
    value pairs instead of converting slashes to backslashes.
    Distinction between hickle 4.0.0 string and byte keys with converted
    slashes is made by enclosing sting value within double quotes
    instead of single qoutes as donw by Python repr function or !r or %r
    string format specifiers. Consequently on load all string keys which
    are enclosed in single quotes will be subjected to slash conversion
    while any others will be used as ar.

    h5py.Group and h5py.Dataset objects the 'base_type' rerfers to 'pickle'
    are on load automatically get assigned object as their py_object_type.
    The related 'type' attribute is ignored. h5py.Dataset objects which do
    not expose a 'base_type' attribute are assumed to contain pickle string
    and thus get implicitly assigned 'pickle' base type. Thus on dump for all
    h5py.Dataset objects which contain pickle strings 'base_type' and 'type'
    attributes are ommited as their values are 'pickle' and object respective.

Other stuff:
============
    Full separation between hickle core and loaders

    Distinct unit tests for individual loaders and hickle core

    Cleanup of not any more required functions and classes

    Simplification of recursion on dump and load through self contained
    loader interface.

    is capbable to load hickle 4.0.x files which do not yet
    support PyContainer concept beyond list, tuple, dict and set
    includes extended test of loading hickel 4.0.x files

    contains fix for labda py_obj_type issue on numpy arrays with
    single non list/tuple object content. Python 3.8 refuses to
    unpickle lambda function string. Was observerd during finalizing
    pullrequest. Fixes are only activated when 4.0.x file is to be
    loaded

    Exceptoin thrown by load now includes exception triggering it
    including stacktrace for better localization of error in debuggin
    and error reporting.

    h5py version limited to <3.x according to issue telegraphic#143
  • Loading branch information
hernot committed Dec 3, 2020
1 parent b20b2d2 commit c681971
Show file tree
Hide file tree
Showing 5 changed files with 77 additions and 61 deletions.
3 changes: 2 additions & 1 deletion hickle/loaders/load_pandas.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import pandas as pd

# TODO: populate with classes to load
class_register = []
class_register = []
exclude_register = []
63 changes: 20 additions & 43 deletions hickle/lookup.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,23 @@
"""
#lookup.py
<<<<<<< HEAD
This file manages all the mappings between hickle/HDF5 metadata and python
types.
There are three dictionaries that are populated here:
1) types_dict
Mapping between python types and dataset and group creation functions, e.g.
=======
This file contains all the mappings between hickle/HDF5 metadata and python types.
There are four dictionaries and one set that are populated here:
1) types_dict
types_dict: mapping between python types and dataset creation functions, e.g.
>>>>>>> Adding setup.py optional dependencies
types_dict = {
list: create_listlike_dataset,
int: create_python_dtype_dataset,
np.ndarray: create_np_array_dataset
list: (create_listlike_dataset, 'list'),
int: (create_python_dtype_dataset, 'int'),
np.ndarray: (create_np_array_dataset, 'ndarray'),
}
2) hkl_types_dict
hkl_types_dict: mapping between hickle metadata and dataset loading functions, e.g.
Mapping between hickle metadata and dataset loading functions, e.g.
hkl_types_dict = {
"<type 'list'>" : load_list_dataset,
"<type 'tuple'>" : load_tuple_dataset
'list': load_list_dataset,
'tuple': load_tuple_dataset
}
3) hkl_container_dict
Expand All @@ -36,33 +28,17 @@
'dict': DictLikeContainer
}
5) types_not_to_sort
type_not_to_sort is a list of hickle type attributes that may be hierarchical,
but don't require sorting by integer index.
## Extending hickle to add support for other classes and types
The process to add new load/dump capabilities is as follows:
1) Create a file called load_[newstuff].py in loaders/
2) In the load_[newstuff].py file, define your create_dataset and load_dataset functions,
along with all required mapping dictionaries.
3) Add an import call here, and populate the lookup dictionaries with update() calls:
# Add loaders for [newstuff]
try:
from .loaders.load_[newstuff[ import types_dict as ns_types_dict
from .loaders.load_[newstuff[ import hkl_types_dict as ns_hkl_types_dict
types_dict.update(ns_types_dict)
hkl_types_dict.update(ns_hkl_types_dict)
... (Add container_types_dict etc if required)
except ImportError:
raise
2) In the load_[newstuff].py file, define your create_dataset and load_dataset
functions, along with the 'class_register' and 'exclude_register' lists.
"""

import six
import pkg_resources

<<<<<<< HEAD
# %% IMPORTS
# Built-in imports
import sys
Expand All @@ -79,13 +55,11 @@

# hickle imports
from .helpers import PyContainer,not_dumpable,nobody_is_my_name
=======
>>>>>>> Adding setup.py optional dependencies

def return_first(x):
""" Return first element of a list """
return x[0]

# %% GLOBALS
# Define dict of all acceptable types
types_dict = {}

# Define dict of all acceptable hickle types
hkl_types_dict = {}
Expand All @@ -96,9 +70,6 @@ def return_first(x):
# Empty list (hashable) of loaded loader names
loaded_loaders = set()

if six.PY2:
container_key_types_dict[b"<type 'unicode'>"] = unicode
container_key_types_dict[b"<type 'long'>"] = long

# %% FUNCTION DEFINITIONS
def load_nothing(h_node,base_type,py_obj_type): # pragma: nocover
Expand Down Expand Up @@ -139,6 +110,7 @@ def register_class(myclass_type, hkl_str, dump_function=None, load_function=None
Parameters:
-----------
myclass_type type(class): type of class
hkl_str (str): String to write to HDF5 file to describe class
dump_function (function def): function to write data to HDF5
load_function (function def): function to load data from HDF5
container_class (class def): proxy class to load data from HDF5
Expand Down Expand Up @@ -195,10 +167,12 @@ def register_class(myclass_type, hkl_str, dump_function=None, load_function=None


def register_class_exclude(hkl_str_to_ignore):
""" Tell loading funciton to ignore any HDF5 dataset with attribute 'type=XYZ'
""" Tell loading funciton to ignore any HDF5 dataset with attribute
'type=XYZ'
Args:
hkl_str_to_ignore (str): attribute type=string to ignore and exclude from loading.
hkl_str_to_ignore (str): attribute type=string to ignore and exclude
from loading.
"""

if hkl_str_to_ignore in {b'dict_item',b'pickle'}:
Expand Down Expand Up @@ -235,6 +209,9 @@ def load_loader(py_obj_type, type_mro = type.mro):
-------
RuntimeError:
in case py object is defined by hickle core machinery.
"""

# any function or method object, any class object will be passed to pickle
# ensure that in any case create_pickled_dataset is called.

Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
dill>=0.3.0
h5py>=2.8.0
h5py>=2.8.0,<3
numpy>=1.8
six>=1.11.0
3 changes: 2 additions & 1 deletion requirements_test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ astropy>=1.3,<4.0
scipy>=1.0.0
pandas>=0.24.0
check-manifest
twine>=1.13.0
twine>=1.13.0
h5py<3
67 changes: 52 additions & 15 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,31 +1,68 @@
# To increment version
# Check you have ~/.pypirc filled in
# git tag x.y.z
# git push --tags
# python setup.py sdist upload
# git push && git push --tags
# rm -rf dist; python setup.py sdist bdist_wheel
# TEST: twine upload --repository-url https://test.pypi.org/legacy/ dist/*
# twine upload dist/*

from codecs import open
import re

from setuptools import setup, find_packages
import sys

author = "Danny Price, Ellert van der Velden and contributors"

with open("README.md", "r") as fh:
long_description = fh.read()

with open("requirements.txt", 'r') as fh:
requirements = fh.read().splitlines()

with open("requirements_test.txt", 'r') as fh:
test_requirements = fh.read().splitlines()

# Read the __version__.py file
with open('hickle/__version__.py', 'r') as f:
vf = f.read()

version = '3.3.0'
author = 'Danny Price'
# Obtain version from read-in __version__.py file
version = re.search(r"^_*version_* = ['\"]([^'\"]*)['\"]", vf, re.M).group(1)

setup(name='hickle',
version=version,
description='Hickle - a HDF5 based version of pickle',
description='Hickle - an HDF5 based version of pickle',
long_description=long_description,
long_description_content_type='text/markdown',
author=author,
author_email='dan@thetelegraphic.com',
url='http://github.com/telegraphic/hickle',
download_url='https://github.com/telegraphic/hickle/archive/%s.tar.gz' % version,
download_url=('https://github.com/telegraphic/hickle/archive/v%s.zip'
% (version)),
platforms='Cross platform (Linux, Mac OSX, Windows)',
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
'Intended Audience :: Science/Research',
'License :: OSI Approved',
'Natural Language :: English',
'Operating System :: MacOS',
'Operating System :: Microsoft :: Windows',
'Operating System :: Unix',
'Programming Language :: Python',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Topic :: Software Development :: Libraries :: Python Modules',
'Topic :: Utilities',
],
keywords=['pickle', 'hdf5', 'data storage', 'data export'],
#py_modules = ['hickle', 'hickle_legacy'],
install_requires=['numpy', 'h5py'],
extras_require={
'astropy': ['astropy'],
'scipy': ['scipy'],
'pandas': ['pandas'],
'color': ['django']
},
python_requires='>=2.7',
install_requires=requirements,
tests_require=test_requirements,
python_requires='>=3.5',
packages=find_packages(),
zip_safe=False,
)

0 comments on commit c681971

Please sign in to comment.