New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue in h5py in multi-threaded environment #1516
Comments
The test code is as simple as that:
The last but one line (h5py.File.close) is taking ~50s on a spinning disk without releasing the GIL |
This snippet of code demonstrate that the GIL is not released when the file is being closed, which can last a bit as in this example
|
And it's really closing the file that hangs, not the In any case, I can't think of any Python callbacks that would fire on close except when you use the |
Hi Thomas, We double (even triple) checked, it is not the flush that is hanging but the call to h5i.dec_ref (called from close() in h5py._hl.files). The I believe it is better (and simpler) to stay with the posix file driver from the HDFgroup (not that I don't trust the fileobj driver). We also found that the position of the bottleneck is related to the speed of the drive: on actual spinning disks (~100-200MB/s) to fast NVMe (2500MB/s) the bottleneck is there. With ramdisks, the bottleneck appears in the dataset creation. I did not investigate networked filesystem for now. With this patch I get 87 heartbeat with the patch instead of 5 to 12 without for a total execution time of 86.2s (i.e. no lose). Cheers, |
Yup, sorry, I wasn't suggesting that the |
I was also afraid of that and checked it did not seg-fault. |
Sorry for hi-jacking this thread. It would be super awesome if we could release the GIL during the execution of filters such as blosc, which supports multi-threaded compressors: Blosc/hdf5-blosc#25 |
AIUI, HDF5 will call the filters during a read/write call. In h5py master, these ( But I think this is orthogonal to whether the filter can use multiple threads internally. The GIL only restricts running Python code; multiple threads can still run concurrently if the GIL is held, so long as they're not running Python code. |
Thank you for your feedback! I now compiled from
I know that you have ton's for experience with CPython, but are you sure this statement is true even for functions (e.g. the filter) called from Python into a C hook? I just read this and it seams to suggest otherwise. You are probably right and it's quite likely caused by something else. Either way, HDF5 c-blosc filter in h5py is definitely not using its excellent threading features as far as I can test it. |
I added an extended demonstrator here: Blosc/hdf5-blosc#25 |
@ax3l Any news on hdf5-blosc front? Have you figured out what's blocking the GIL? |
As far as I have understood, HDF5 serializes the compression, therefore the need to have things like direct chunk writing. See page 6 on https://support.hdfgroup.org/HDF5/doc/Advanced/DirectChunkWrite/UsingDirectChunkWrite.pdf Unless the blosc plugin can spawn threads for compressing a single chunk, you have a plausible explanation. |
If I am not mistaken, I think that's exactly what blosc does. |
On Blosc/hdf5-blosc#25, it looks like the issue you're describing is that the Blosc integration doesn't have a way to tell it to use multiple threads when compressing data for HDF5. Unless you've got a reason for thinking otherwise, I'd say that's an entirely separate matter from the GIL. I don't think the GIL should matter at all to Blosc, because I imagine its threads don't need to run Python code. I took a quick dive into the Blosc source code, and spotted this comment
So even if you set the BLOSC_NTHREADS environment variable to allow it to run in parallel, it will ignore you if the chunks HDF5 asks it to compress are small enough that it won't benefit much from parallelising. You don't specify a chunk size in Your workaround of calling Blosc separately and then storing the result should compress an entire image at once, which should make for faster writing and better compression - at a cost of making it slower to read a small part of the image. You can achieve this by setting the chunk size to the size of the image: A couple of unrelated notes on the example:
|
In the scope of the development of BLISS (https://bliss.gitlab-pages.esrf.fr/bliss/master/) at ESRF
we found a performance issue related to h5py in a multi-threaded server environment (server unresponsive for seconds).
Basically the thread is blocking when closing the file, GIL not released. After a quick profiling it goes down to h5py._hl.files (line 400 - 422) and most of the time is spent in H5I_dec_ref which does not release the GIL.
We tested to release the GIL there and test passes without crash. I wonder if you know any callback which might be called at the file-driver level which may be implemented in Python and cause GIL-related crashes ?
If there ain't any, would you consider a PR which declares the function from H5I as GIL free ?
Cheer,
Jerome
To assist reproducing bugs, please include the following:
The text was updated successfully, but these errors were encountered: