Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue in h5py in multi-threaded environment #1516

Closed
kif opened this issue Mar 19, 2020 · 14 comments · Fixed by #1521
Closed

Performance issue in h5py in multi-threaded environment #1516

kif opened this issue Mar 19, 2020 · 14 comments · Fixed by #1521
Assignees

Comments

@kif
Copy link
Contributor

kif commented Mar 19, 2020

In the scope of the development of BLISS (https://bliss.gitlab-pages.esrf.fr/bliss/master/) at ESRF
we found a performance issue related to h5py in a multi-threaded server environment (server unresponsive for seconds).

Basically the thread is blocking when closing the file, GIL not released. After a quick profiling it goes down to h5py._hl.files (line 400 - 422) and most of the time is spent in H5I_dec_ref which does not release the GIL.
We tested to release the GIL there and test passes without crash. I wonder if you know any callback which might be called at the file-driver level which may be implemented in Python and cause GIL-related crashes ?
If there ain't any, would you consider a PR which declares the function from H5I as GIL free ?
Cheer,

Jerome

To assist reproducing bugs, please include the following:

  • Operating System Linux, various flavours
  • Python version 3.7 ananconda, or system
  • h5py version: 2.10, master
  • HDF5 version: 1.10.4, 1.10.5 (1.10.6?)
@kif kif self-assigned this Mar 19, 2020
@kif
Copy link
Contributor Author

kif commented Mar 19, 2020

The test code is as simple as that:

import numpy, h5py, time
print(f"h5py: {h5py.version.version} hdf5: {h5py.version.hdf5_version}")
SIZE = 1000000000
data = numpy.arange(SIZE)
print(f"Working with dataset of size: {data.nbytes>>20} MB")
t0 = time.time()
f = h5py.File("/tmp/test.h5", "w")
f["data"] = data
f.flush()
f.close()
print(f"Written at {data.nbytes/(time.time()-t0)/(1<<20)} MB/s")

The last but one line (h5py.File.close) is taking ~50s on a spinning disk without releasing the GIL

@kif
Copy link
Contributor Author

kif commented Mar 20, 2020

This snippet of code demonstrate that the GIL is not released when the file is being closed, which can last a bit as in this example

from time import sleep
from threading import Thread
import h5py
import numpy

def background_task():
    i = 1
    while not background_task.cancelled:
        print(f"heartbeat {i}")
        i += 1
        sleep(1)

background_task.cancelled = False
heartbeat = Thread(target=background_task)
heartbeat.start()

try:
    data = numpy.arange(1000000000)
    for i in range(2):
            print(f"open h5py.File {i}")
            with h5py.File("/tmp/test.h5", "w") as f:
                print("create dataset ...")
                f["data"] = data
                print("big flush ...")
                f.flush()
                print("big flush done")
            print(f"closed h5py.File {i}")
finally:
    background_task.cancelled = True
    heartbeat.join()

@takluyver
Copy link
Member

And it's really closing the file that hangs, not the f.flush() call? I would have guessed that if you explicitly flush first, that would be slow and then close would be (relatively) fast.

In any case, I can't think of any Python callbacks that would fire on close except when you use the fileobj driver, and that's already marked to reacquire the GIL if needed.

@kif
Copy link
Contributor Author

kif commented Mar 23, 2020

Hi Thomas,

We double (even triple) checked, it is not the flush that is hanging but the call to h5i.dec_ref (called from close() in h5py._hl.files).
This issue arose ~3 weeks ago in the development of BLISS:
https://gitlab.esrf.fr/bliss/bliss/issues/1431

The fileobj driver has been investigated, it helps a bit: the code is not slower but then the time is spent in the fileobj.close() which is neither releasing the GIL.

I believe it is better (and simpler) to stay with the posix file driver from the HDFgroup (not that I don't trust the fileobj driver). We also found that the position of the bottleneck is related to the speed of the drive: on actual spinning disks (~100-200MB/s) to fast NVMe (2500MB/s) the bottleneck is there. With ramdisks, the bottleneck appears in the dataset creation. I did not investigate networked filesystem for now.

With this patch I get 87 heartbeat with the patch instead of 5 to 12 without for a total execution time of 86.2s (i.e. no lose).

Cheers,

@takluyver
Copy link
Member

Yup, sorry, I wasn't suggesting that the fileobj driver would help, just that that's the only case I can think of where it would trigger a Python callback, and that's already handled.

@kif
Copy link
Contributor Author

kif commented Mar 23, 2020

I was also afraid of that and checked it did not seg-fault.

@ax3l
Copy link
Contributor

ax3l commented Apr 25, 2020

Sorry for hi-jacking this thread. It would be super awesome if we could release the GIL during the execution of filters such as blosc, which supports multi-threaded compressors: Blosc/hdf5-blosc#25
Is that possible?

@takluyver
Copy link
Member

AIUI, HDF5 will call the filters during a read/write call. In h5py master, these (H5Dread & H5Dwrite) now release the GIL. So if the filter is implemented in C code with no Python involved, the GIL should be released while it runs. If there's Python in that, it will need to reacquire the GIL to run Python code, and then release it again.

But I think this is orthogonal to whether the filter can use multiple threads internally. The GIL only restricts running Python code; multiple threads can still run concurrently if the GIL is held, so long as they're not running Python code.

@ax3l
Copy link
Contributor

ax3l commented Apr 26, 2020

Thank you for your feedback!

I now compiled from master and can definitely say that the c-blosc filter still does not spawn threads. The native python-blosc bindings in turn to thread, with and without releasing the GIL.

The GIL only restricts running Python code; multiple threads can still run concurrently if the GIL is held, so long as they're not running Python code.

I know that you have ton's for experience with CPython, but are you sure this statement is true even for functions (e.g. the filter) called from Python into a C hook? I just read this and it seams to suggest otherwise.

You are probably right and it's quite likely caused by something else. Either way, HDF5 c-blosc filter in h5py is definitely not using its excellent threading features as far as I can test it.

@ax3l
Copy link
Contributor

ax3l commented Apr 26, 2020

I added an extended demonstrator here: Blosc/hdf5-blosc#25

@aldanor
Copy link

aldanor commented Jan 19, 2021

@ax3l Any news on hdf5-blosc front? Have you figured out what's blocking the GIL?

@vasole
Copy link
Contributor

vasole commented Jan 19, 2021

As far as I have understood, HDF5 serializes the compression, therefore the need to have things like direct chunk writing.

See page 6 on https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite/UsingDirectChunkWrite.pdf

Unless the blosc plugin can spawn threads for compressing a single chunk, you have a plausible explanation.

@ax3l
Copy link
Contributor

ax3l commented Jan 20, 2021

Unless the blosc plugin can spawn threads for compressing a single chunk, you have a plausible explanation.

If I am not mistaken, I think that's exactly what blosc does.

@takluyver
Copy link
Member

On Blosc/hdf5-blosc#25, it looks like the issue you're describing is that the Blosc integration doesn't have a way to tell it to use multiple threads when compressing data for HDF5. Unless you've got a reason for thinking otherwise, I'd say that's an entirely separate matter from the GIL. I don't think the GIL should matter at all to Blosc, because I imagine its threads don't need to run Python code.

I took a quick dive into the Blosc source code, and spotted this comment

Run the serial version when nthreads is 1 or when the buffers are not much larger than blocksize

So even if you set the BLOSC_NTHREADS environment variable to allow it to run in parallel, it will ignore you if the chunks HDF5 asks it to compress are small enough that it won't benefit much from parallelising. You don't specify a chunk size in create_dataset(), so h5py will guess it for you. The logic for both h5py's chunk size and Blosc's block size is somewhat complicated, but I think there's a good chance you're hitting this case.

Your workaround of calling Blosc separately and then storing the result should compress an entire image at once, which should make for faster writing and better compression - at a cost of making it slower to read a small part of the image. You can achieve this by setting the chunk size to the size of the image: chunks=arr.shape.

A couple of unrelated notes on the example:

  • If you want to store a series of images of uniform size, it may be neater to do that in a single 3D dataset rather than many 2D datasets. E.g. this makes it easy to find how many images there are.
  • If you do need to compress data separately and then write it to HDF5, the low-level write_direct_chunk method lets you do that while still making a meaningful dataset (real shape & dtype, can be read with the appropriate filter). This is a bit more fiddly than the high-level API, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants