H5PL: Plugin Interface #1166

aparamon · 2019-01-23T14:04:28Z

Implementation of #928.

Please review. MutableSequence high-level interface can in principle be implemented, but do we really need it for such a niche functionality?

codecov · 2019-01-24T09:00:30Z

Codecov Report

Merging #1166 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #1166   +/-   ##
=======================================
  Coverage   83.73%   83.73%           
=======================================
  Files          18       18           
  Lines        2146     2146           
=======================================
  Hits         1797     1797           
  Misses        349      349

Impacted Files	Coverage Δ
h5py/__init__.py	`59.64% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6768237...885a286. Read the comment docs.

takluyver · 2019-01-29T18:00:35Z

h5py/tests/old/test_h5pl.py

+class TestSearchPaths(TestCase):
+
+    def test_default(self):
+        self.assertEqual(h5pl.size(), 1)


Does this depend on the HDF5_PLUGIN_PATH environment variable?

No, quite the contrary: it assumes the default value (see next line).
Running in the separate process, without side effects and predictable HDF5_PLUGIN_PATH sounds good. What's the most elegant way to do so?

I don't know if pytest has some neat magic to make this convenient, but I would do something like this:

from subprocess import run, PIPE, STDOUT res = run([sys.executable, '-m', 'pytest', '...'], stdout=PIPE, stderr=STDOUT) print(res.stdout) assert res.returncode == 0

The ... there would contain whatever arguments are needed to make pytest run the tests in a module it would otherwise skip.

Are you sure current CI uses pytest and not unittest2 at all?

PR #1003 was merged, so as far as I know, it uses pytest.

A lot of tests are still written with the unittest API; pytest can find and run these with no problem. I'm not personally enthusiastic about 'converting' existing tests to a more idiomatic pytest style - it sounds like a lot of change for very little benefit - but I'd say it's OK to write new tests assuming pytest.

What do you think about the approach of https://github.com/pytest-dev/pytest-forked?

Not enthusiastic! It looks like it's implemented as a command-line option which applies to all tests, whereas we want to apply this to only a specific set of tests that manipulate global state. Also, fork-without-exec is easy to mess up.

I can imagine this being a useful general utility, though - run some tests in a subprocess and integrate their results back into the main test results. Pytest itself has some machinery for testing pytest plugins which might be a useful starting point for something like this: https://docs.pytest.org/en/latest/writing_plugins.html#testing-plugins

takluyver · 2019-01-30T09:49:44Z

h5py/tests/old/test_h5pl.py

+
+from ..common import TestCase
+
+@ut.skip('The tests have side effects')


If we go for this, I think we should come up with a way to run the tests by default; maybe run them in a separate process? Tests that don't run by default aren't very helpful.

Tests are also documentation ;-)
But I agree; see my other comment.

takluyver · 2019-01-30T09:51:47Z

h5py/h5pl.pyx

+# === C API ===================================================================
+
+IF HDF5_VERSION >= (1, 10, 1):
+    cpdef append(const char* search_path):


I'm not very familiar with the low-level API, but should these have the with_phil decorator for locking? Check with the other maintainers before adding this; it might not be necessary.

the phil lock is mostly about making sure that we don't create / reuse / destroy c++ objects in an unsafe way (this is to support the correct ref-counting on holding c++ File objects open even if the user drops all (direct) python references to the File and only keeps Datasets). See the code in _objects.pyx which is rather racey (including an issue where we discovered we have to turn off gc temporarily due to list(of_a_dict) having "mutation during iteration" issues).

That is a very long way of saying I don't think we need to lock with the same lock as everything else (as if the user has race conditions between adding plugins to the search path and opening files I think that is on them).

On the other hand, these should be used rarely enough that locking could not hurt. We may also want to lock them between eachother (as I can imagine an append and a remove colliding and ending up with a "gap" in the hdf5 datastrucutres....).

h5py/h5pl.pyx

aparamon · 2019-02-04T15:04:21Z

Uhh, it took longer to make it run on all Pythons, but now at least Travis passes. I have no idea what's going on with AppVeyor though: https://ci.appveyor.com/project/h5py/h5py/builds/22108613/job/1aqbnhig7tw4hxr4#L237

@takluyver Could you please take another look and make your judgement whether my sandboxing implementation is in principle tolerable.

takluyver · 2019-02-05T15:53:05Z

h5py/tests/old/test_h5pl.py

+        h5pl.append(b'/opt/hdf5/vendor-plugin')
+        self.assertEqual(h5pl.size(), 2)
+        self.assertTrue(h5pl.get(0).endswith(b'hdf5/lib/plugin\x00'))
+        self.assertEqual(h5pl.get(1), b'/opt/hdf5/vendor-plugin\x00')


I would generally expect a Python API to return strings without the trailing null. Does h5py's low level API include the trailing null elsewhere?

I'm not sure, and frankly I was quite surprised HDF5 returns nulls here. The docs do not mention it (if I'm not missing something).
Being in doubt I went with "h5py is a thin, pythonic wrapper" motto.

I'd guess it doesn't mention it because it's normal for C to store strings in n+1 bytes including the terminating null byte. Maybe Cython has some convenient functionality to trim off the null byte?

takluyver · 2019-02-05T16:08:32Z

For running the tests in a subprocess, I was thinking of something like using a pytest marker along with a pytest configuration file to exclude these tests by default. Then run pytest in a subprocess, overriding the marker selection to run only those tests.

I haven't 100% figured this out, but I'm pretty sure it should be possible, and it saves creating temporary files for the test code.

takluyver · 2019-02-05T16:16:54Z

The problem with Appveyor appears to be affecting my build on #1132 as well, so it looks like it's not your PR causing it.

Someone gets to have fun debugging Windows testing issues! 😆 😞

aparamon · 2019-02-05T16:20:43Z

Well, each of H5PL tests needs to be run in a subprocess. If we are less pedantic, these could be run independently of the first batch but without isolation; the failures would be a bit harder to localize, but pragmatically that might still be better compared to per-test sandboxing hackery.

However I see a great surprise effect for the users when they discover multiple pytest invocations are needed :-/ These tests are making me sad; maybe remove them altogether?

aparamon · 2019-02-05T16:22:15Z

Someone gets to have fun debugging Windows testing issues! 😆 😞

It there a way to ssh/Remote Desktop there?

takluyver · 2019-02-05T16:27:11Z

I'd say it's sufficient to run the whole group of them in a subprocess, rather than each one individually. I don't think it's too big a problem if they affect each other, but I'd rather they didn't cause cascading failures through the test suite.

I feel like we're missing something, though. This can't be the only project that wants to isolate a group of tests that work by changing global state.

It there a way to ssh/Remote Desktop there?

Looks like there is if you have access to the appveyor control panel: https://www.appveyor.com/docs/how-to/rdp-to-build-worker/

aparamon · 2019-02-05T16:50:05Z

https://github.com/pytest-dev/pytest-forked is probably closest from what I googled so far.

takluyver · 2019-02-05T16:56:19Z

h5py/h5pl.pyx

+            H5PLget(index, buf, n + 1)
+            return PyBytes_FromStringAndSize(buf, n)
+        finally:
+            efree(buf)


FWIW, here's another place in the codebase doing something similar:

h5py/h5py/h5p.pyx

Lines 849 to 856 in c585944

size = H5Pget_virtual_dsetname(self.id, index, NULL, 0)

name = <char*>emalloc(size+1)

try:

# TODO check return size

H5Pget_virtual_dsetname(self.id, index, name, <size_t>size+1)

src_dset_name = bytes(name).decode('utf-8')

finally:

efree(name)

I suspect that the hardcoded decode-as-utf8 there is not quite right, but it looks like one can call bytes(buf) to convert the buffer to Python bytes.

takluyver · 2019-02-05T16:58:35Z

Yeah, pytest-forked is similar, but that's for when you want to isolate every test in your test suite, because any of them might segfault. And it doesn't work on Windows. I think there must be other projects which want to isolate just a few particular tests that are doing strange things.

aparamon · 2019-02-05T17:11:22Z

Well, I'm attending https://conf.python.ru/2019, so might be pointed to something worthy there ;-)
I agree that that the task is quite generic, but it's just hard to implement (e.g. Windows lacking fork).

takluyver · 2019-02-05T17:17:46Z

It's not easy to implement, but I don't think it's all that hard either; it needs a bit of thinking about how to get the right data to the right place. Maybe other projects just have a similar bit of ugly infrastructure built into their test suites. I don't know where to look for it, though. Perhaps it's more niche than I thought.

aparamon · 2019-02-06T12:36:22Z

Hit by morning inspiration, I went for something completely different. Why bother with subprocess isolation if HDF5 can be re-initialized afresh with H5close+H5open combo?

class TestSearchPaths(ut.TestCase):

    def setUp(self):
        h5.close()
        # as per https://support.hdfgroup.org/HDF5/doc/Advanced/DynamicallyLoadedFilters/HDF5DynamicallyLoadedFilters.pdf,
        # in case your HDF5 setup has it different
        # (e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826522)
        os.environ['HDF5_PLUGIN_PATH'] = os.path.expandvars('%ALLUSERSPROFILE%/hdf5/lib/plugin') \
            if platform.system() == 'Windows' else '/usr/local/hdf5/lib/plugin'
        h5.open()

    def tearDown(self):
        h5.close()
        h5.open()

Exposing H5open/H5close was easy; unfortunately it soon appeared that h5py itself does it's own initialization. And importlib.reload(h5py) didn't prove enough, unfortunately:

h5py/h5py/h5t.pyx

Lines 211 to 247 in c585944

    
           # Mini (or short) floats 
        
           IEEE_F16BE = IEEE_F32BE.copy() 
        
           IEEE_F16BE.set_fields(15, 10, 5, 0, 10) 
        
           IEEE_F16BE.set_size(2) 
        
           IEEE_F16BE.set_ebias(15) 
        
           IEEE_F16BE.lock() 
        
           IEEE_F16LE = IEEE_F16BE.copy() 
        
           IEEE_F16LE.set_order(H5T_ORDER_LE) 
        
           IEEE_F16LE.lock() 
        
           # Quad floats 
        
           IEEE_F128BE = IEEE_F64BE.copy() 
        
           IEEE_F128BE.set_size(16) 
        
           IEEE_F128BE.set_precision(128) 
        
           IEEE_F128BE.set_fields(127, 112, 15, 0, 112) 
        
           IEEE_F128BE.set_ebias(16383) 
        
           IEEE_F128BE.lock() 
        
           IEEE_F128LE = IEEE_F128BE.copy() 
        
           IEEE_F128LE.set_order(H5T_ORDER_LE) 
        
           IEEE_F128LE.lock() 
        
           LDOUBLE_LE = NATIVE_LDOUBLE.copy() 
        
           LDOUBLE_LE.set_order(H5T_ORDER_LE) 
        
           LDOUBLE_LE.lock() 
        
           LDOUBLE_BE = NATIVE_LDOUBLE.copy() 
        
           LDOUBLE_BE.set_order(H5T_ORDER_BE) 
        
           LDOUBLE_BE.lock() 
        
           # Custom Python object pointer type 
        
           cdef hid_t H5PY_OBJ = H5Tcreate(H5T_OPAQUE, sizeof(PyObject*)) 
        
           H5Tset_tag(H5PY_OBJ, "PYTHON:OBJECT") 
        
           H5Tlock(H5PY_OBJ) 
        
           PYTHON_OBJECT = lockid(H5PY_OBJ)

h5py/h5py/h5t.pyx

Lines 1402 to 1408 in c585944

    
           cdef dict _int_le = {1: H5Tcopy(H5T_STD_I8LE), 2: H5Tcopy(H5T_STD_I16LE), 4: H5Tcopy(H5T_STD_I32LE), 8: H5Tcopy(H5T_STD_I64LE)} 
        
           cdef dict _int_be = {1: H5Tcopy(H5T_STD_I8BE), 2: H5Tcopy(H5T_STD_I16BE), 4: H5Tcopy(H5T_STD_I32BE), 8: H5Tcopy(H5T_STD_I64BE)} 
        
           cdef dict _int_nt = {1: H5Tcopy(H5T_NATIVE_INT8), 2: H5Tcopy(H5T_NATIVE_INT16), 4: H5Tcopy(H5T_NATIVE_INT32), 8: H5Tcopy(H5T_NATIVE_INT64)} 
        
           cdef dict _uint_le = {1: H5Tcopy(H5T_STD_U8LE), 2: H5Tcopy(H5T_STD_U16LE), 4: H5Tcopy(H5T_STD_U32LE), 8: H5Tcopy(H5T_STD_U64LE)} 
        
           cdef dict _uint_be = {1: H5Tcopy(H5T_STD_U8BE), 2: H5Tcopy(H5T_STD_U16BE), 4: H5Tcopy(H5T_STD_U32BE), 8: H5Tcopy(H5T_STD_U64BE)} 
        
           cdef dict _uint_nt = {1: H5Tcopy(H5T_NATIVE_UINT8), 2: H5Tcopy(H5T_NATIVE_UINT16), 4: H5Tcopy(H5T_NATIVE_UINT32), 8: H5Tcopy(H5T_NATIVE_UINT64)}

Even importlib.reload(h5py.h5t) is not enough, due to cython/cython#2659.
The following code allowed me to pass all tests:

    def tearDown(self):
        h5.close()
        h5.open()
        h5py.h5t.init()  # new function
        importlib.reload(h5py._hl.base)
        h5py._hl.base.dlapl = h5py._hl.base.default_lapl()
        h5py._hl.base.dlcpl = h5py._hl.base.default_lcpl()
        importlib.reload(h5py)

but I'm now not sure if it's as appealing as perceived initially :-)

Do we want to make targeted changes to allow re-initialization of h5py after it's loaded? Might it give some additional benefits?

takluyver · 2019-02-06T13:54:30Z

That doesn't sound like a promising approach, unfortunately. I don't think tests should rely on a global 'reset button' that isn't used explicitly in normal code. It's easy for oversights to mean the reset button doesn't reset everything, and there may be corner cases when it messes up debugging. The guaranteed way to get a clean, isolated process environment is to start a new process.

Let's wait and see what the other maintainers (@tacaswell @aragilar) think of this. Maybe if it's only adding low-level API functions, it's OK for the tests not to run by default. Or perhaps they have some other idea about how to handle it.

vasole · 2019-03-12T20:45:33Z

@aparamon

I have been able to set the path after importing h5py and I have been able to read the data. Thanks a lot!

Is it intentional that I have to specify the path as a byte string? It is a bit confusing that when using HDF5_PLUGIN_PATH I can use a string and that when using the function I have to use bytes.

It took me some time to figure out why append was not working. I had to get rid of the non existing default directory or to use insert(path, 0) that was also working.

Tested on windows, python 3.6, hdf5 1.10.5 as supplied by the HDF Group.

import os
import h5py
# cleanup directories that might not exist
# other way append does not work and I have to use insert (path, 0)
# that works
delete = []
for i in range(h5py.h5pl.size()):
    dirname = h5py.h5pl.get(i)
    if not os.path.exists(dirname):
        print("directory <%s> to be deleted" % dirname)
        delete.append(i)
delete.reverse()
for i in delete:
    h5py.h5pl.remove(i)

fname = "lz4.h5"
assert(os.path.exists(fname))

for attempt in range(2):
    # first try to read without filter
    if attempt:
        h5py.h5pl.append(b"C:\\GIT\\hdf5plugin\\hdf5plugin\\VS2015\\x64")
        #h5py.h5pl.insert(b"C:\\GIT\\hdf5plugin\\hdf5plugin\\VS2015\\x64", 0)
        for i in range(h5py.h5pl.size()):
            print("i = ", i, "path = ", h5py.h5pl.get(i))    
    try:
        h5 = h5py.File(fname, "r")
        data = h5["/entry/data"][:]
        h5.close()
        expected_shape = (50, 2167, 2070)
        assert(data.shape[0] == 50)
        assert(data.shape[1] == 2167)
        assert(data.shape[2] == 2070)
        assert(data[21, 1911, 1549] == 3141)
        if attempt:
            print("SUCCESS!")
        else:
            print("Read without need of filter. Bad test")
    except:
        if attempt == 0:
            print("Error expected. Great")
        else:
            print("Error NOT expected :(")

aparamon · 2019-03-12T21:43:40Z

@vasole Thanks for the review!
I was hesitant to decode HDF5 byte-strings because I was not sure about the encoding. Maybe it's the right thing to do in some higher-level interface (based on MutableSequence), but for now I propose to explicitly encode()/decode() strings for use in h5pl functions.
HDF5_PLUGIN_PATH is environment variable which is already a decoded string in Python.

takluyver · 2019-04-26T10:25:18Z

Ping @tacaswell @aragilar - I wanted your ideas about testing this before we merge it.

To summarise the discussion: it's tricky to test nicely, because the desired effect of these functions is global changes in the HDF5 library. Unless you can be sure you're undoing global changes properly, they can cause other tests to behave strangely, which is horrendous to debug. So I'd like to run these tests in their own subprocess so they don't interfere. But we haven't found a good way to do that.

So, do we:

Add the functions without tests, or with tests that aren't run by default? It's only exposing some more functions from HDF5, so maybe we don't need to test them.
Write normal tests, and trust that either the cleanup or the other tests are robust enough to cope with this.
Hack our own machinery to run these tests in a subprocess and bring the results back to the main process (the current state of this PR).
Use some package I'm not aware of to deal with running the tests in a separate process.

aragilar · 2019-04-30T05:44:24Z

A variant of the first option might be a good middle ground: skip the plugin tests in the main test run, and have a separate test run for just the plugin tests (using something like https://docs.pytest.org/en/latest/example/simple.html#control-skipping-of-tests-according-to-command-line-option and an addition text env in tox). This won't stop the plugin tests interfering with each other, but at least it will localise the problem to the plugin system.

scopatz · 2019-06-04T13:47:16Z

It is a little hard to believe that we are still using unittest here

scopatz · 2019-06-04T15:28:38Z

Also, I have added an insubporcess() decorator for testing in #1224

takluyver · 2019-06-13T07:15:49Z

Can I ask you to rebase this on master and try making use of the @insubprocess test decorator @scopatz added in #1224?

Also, I think that the new h5pl module should have a corresponding rst document in docs_api.

takluyver · 2019-06-29T08:49:17Z

I've rebased and continued this in #1256

takluyver · 2019-07-03T08:47:08Z

This was merged as #1256.

aparamon added 5 commits January 23, 2019 16:54

Add API declarations

b824f42

Implement low-level interface

4434c17

Add client demo code

2918654

Remove trailing whitespace

d316a45

Fix for HDF5<1.10.1

633de6d

takluyver reviewed Jan 30, 2019

View reviewed changes

aparamon added 5 commits February 4, 2019 15:34

Implement test sandboxing

572097a

Fix red tests

fdc414c

Fix red tests -- Attempt 2

d958829

Fix red tests -- Attempt 3

c3f0313

Fix red tests -- Attempt 4

9366d42

takluyver reviewed Feb 5, 2019

View reviewed changes

Strip trailing null-char

abe7cde

takluyver reviewed Feb 5, 2019

View reviewed changes

Readability change

885a286

aparamon mentioned this pull request Mar 12, 2019

all_filters_avail. h5py bug or HDF5 bug? #843

Closed

vasole mentioned this pull request Mar 14, 2019

h5py support for additional filters silx-kit/hdf5plugin#14

Closed

tacaswell added this to the 2.10 milestone Mar 17, 2019

takluyver mentioned this pull request Jun 3, 2019

Enh add extra pt exposurec (again) #1224

Merged

takluyver mentioned this pull request Jun 29, 2019

Expose H5PL low-level API #1256

Merged

takluyver closed this Jul 3, 2019

vasole mentioned this pull request Aug 14, 2020

Provide a interface for registering filter plugins #928

Closed


		from ..common import TestCase

		@ut.skip('The tests have side effects')

	size = H5Pget_virtual_dsetname(self.id, index, NULL, 0)
	name = <char*>emalloc(size+1)
	try:
	# TODO check return size
	H5Pget_virtual_dsetname(self.id, index, name, <size_t>size+1)
	src_dset_name = bytes(name).decode('utf-8')
	finally:
	efree(name)

H5PL: Plugin Interface #1166

H5PL: Plugin Interface #1166

Conversation

aparamon commented Jan 23, 2019

codecov bot commented Jan 24, 2019 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aparamon commented Feb 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

takluyver commented Feb 5, 2019

takluyver commented Feb 5, 2019

aparamon commented Feb 5, 2019

aparamon commented Feb 5, 2019 • edited

takluyver commented Feb 5, 2019

aparamon commented Feb 5, 2019 • edited

Choose a reason for hiding this comment

takluyver commented Feb 5, 2019

aparamon commented Feb 5, 2019

takluyver commented Feb 5, 2019

aparamon commented Feb 6, 2019

takluyver commented Feb 6, 2019

vasole commented Mar 12, 2019 • edited

aparamon commented Mar 12, 2019

takluyver commented Apr 26, 2019

aragilar commented Apr 30, 2019

scopatz commented Jun 4, 2019

scopatz commented Jun 4, 2019

takluyver commented Jun 13, 2019

takluyver commented Jun 29, 2019

takluyver commented Jul 3, 2019

codecov bot commented Jan 24, 2019 •

edited

aparamon commented Feb 5, 2019 •

edited

aparamon commented Feb 5, 2019 •

edited

vasole commented Mar 12, 2019 •

edited