Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Saving many datasets in a single group slows down with each new addition #58248

Closed
3 tasks done
Ieremie opened this issue Apr 13, 2024 · 11 comments · Fixed by #58275
Closed
3 tasks done

PERF: Saving many datasets in a single group slows down with each new addition #58248

Ieremie opened this issue Apr 13, 2024 · 11 comments · Fixed by #58275
Labels
Enhancement IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance

Comments

@Ieremie
Copy link

Ieremie commented Apr 13, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

I found a strange behaviour that seems to appear only for PyTabels (Pandas).

Saving many datasets within a single group becomes progressively slower.

import pandas as pd

df = pd.DataFrame({'A': [1.0] * 1000})  
df = pd.concat([df] * 13, axis=1, ignore_index=True)

size = 5000
timings = []
for i in tqdm.tqdm(range(0, size), total=size):
    key = ''.join(random.choices(string.ascii_uppercase, k=20))

    start = time.time()
    df.to_hdf('test.h5', key=key, mode='a', complevel=9) 
    timings.append(time.time() - start)

plt.plot(timings[10:])
image

This is not the case for h5py, which can easily write up to 100 more datasets without slowing down.

import h5py

timings = []
size  = 500000
with h5py.File('test2.h5', 'w', libver='latest') as hf:
    group = hf.create_group(f'group')
    for i in tqdm.tqdm(range(0, size), total=size):
        key = ''.join(random.choices(string.ascii_uppercase, k=20))

        start = time.time()
        group.create_dataset(key, data=df.values, compression="gzip", compression_opts=9) 
        timings.append(time.time() - start)

plt.plot(timings[10:])
image

Installed Versions

Replace this line with the output of pd.show_versions()

Prior Performance

I have raised this issue in the Pytables repo and it seems it is actually an issue with pandas https://github.com/PyTables/PyTables/issues/1155

@Ieremie Ieremie added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Apr 13, 2024
@rhshadrach
Copy link
Member

Thanks for the report!

Between the first and the last run I'm seeing 0.2ms and 34ms spent in tables.file.File.create_group respectively. This is from this line here:

group = self._handle.create_group(path, p)

But I have no familiarity with this code or pytables, so not sure if this is a pandas issue.

@avalentino
Copy link
Contributor

@rhshadrach the create_group method apparently has an execution time that depends on the number of notes in the hdf5 file.
This is only true for newly opened files because PyTables need to rebuild the internal cache of all nodes. While this behaviour can be debated the obvious workaround is to use an HDFStore instead of the filename in the call to df.to_hdf (as described in PyTables/PyTables#1155 (comment)).

Unfortunately this doesn't work because apparently the HDFStore is converted back to a string ("stringfied") in

path_or_buf = stringify_path(path_or_buf)
if isinstance(path_or_buf, str):
with HDFStore(
path_or_buf, mode=mode, complevel=complevel, complib=complib
) as store:
f(store)
else:
f(path_or_buf)

From my point of view this is unexpected and I would be curious to understand if it is really necessary.

@Ieremie
Copy link
Author

Ieremie commented Apr 15, 2024

@rhshadrach I've tested this with HDFStore and it doesn't seem that the slow time is due to opening the file for every write.

size = 50000

timings = []
for i in tqdm.tqdm(range(0, size), total=size):
    key = ''.join(random.choices(string.ascii_uppercase, k=20))

    start = time.time()
    with pd.HDFStore("test2.h5", 'w') as store:
        store.put(f'entry/{key}', df) 
    timings.append(time.time() - start)

plt.plot(timings[10:])

image

@avalentino
Copy link
Contributor

avalentino commented Apr 15, 2024

@rhshadrach in your example you are creating a new HDF3 file at each iteration so the file always has at most one group.
This introduces a small constant delay.
Moreover I would not even expect issues when you use the HDFStor.put method, because it indeed does not re-open the file.
The problem comes when you move the HDFStore instantiation out of the loop and then use the df.to_hdf method.
In that case you have the number of groups increasing in the HDF5 file and, due to the strange behaviour that I highlighted above, and new HDFStore is created at each to_hdf call even if passes as input.

path_or_buf = stringify_path(path_or_buf)
if isinstance(path_or_buf, str):
with HDFStore(
path_or_buf, mode=mode, complevel=complevel, complib=complib
) as store:
f(store)
else:
f(path_or_buf)

In this case you can observe the linear slowdown.

@Ieremie
Copy link
Author

Ieremie commented Apr 15, 2024

@avalentino right, my mistake. Opening with with pd.HDFStore("test2.h5", 'a') as store: makes things slow again.

@avalentino
Copy link
Contributor

SorryI have just realised that I have not tagged the correct person in my previous post.

@Ieremie
Copy link
Author

Ieremie commented Apr 15, 2024

@avalentino Just for curiosity, how is the cache computed? I've noticed that if I split the datasets in multiple groups (still increasingly slow), it makes writing much faster even with the file reopening issue.

@avalentino
Copy link
Contributor

Basically PyTables loads the list of all names in an HDF5 group before doing any change of it. This is of course done only once and caches. This is mostly done to support the "natural naming" feature.

The time for the operation apparently depends on the number of nodes in the group.

In principle "the natural naming" for interactive shells is not a fundamental feature but all the machinery is there since the very beginning so touching it could require some work.

In any case, IMHO I would say that the way to go for your use case is keep adding nodes without closing the file, so I do not see a compelling reason to start thinking to deep changes into PyTables.

Do you agree?

@Ieremie
Copy link
Author

Ieremie commented Apr 15, 2024

@avalentino thanks for the input! yes, that seems a reasonable solution. Hopefully, pandas will update their function to accept an HDFStore too.

@rhshadrach
Copy link
Member

From a quick look, it does appear straight forward to expand to_hdf to accept an HDFStore instance. If the implementation is indeed easy, I'm +1 on the enhancement. If for some reason it turns out to be much more difficult than I expect, I'm not necessarily opposed, but it will have to be evaluated and weighed against the maintenance burden.

PRs to add are welcome!

@rhshadrach rhshadrach added Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2024
@avalentino
Copy link
Contributor

@rhshadrach for what I can say both the documentation and the implementation seem to suggest that the to_hdf method has been designed to be able to use an HDFStore provided in input.
I have created a PR (#58275) that implements a very simple change to avoid systematic re-opening of the hdf5 file at each call to to_hdf.

avalentino added a commit to avalentino/pandas that referenced this issue Apr 16, 2024
mroeschke pushed a commit that referenced this issue Apr 16, 2024
* Avoid unnecessary re-opening of HDF5 files

* Update the whatsnew file

* Move the changelog entry for #58248 to the correct section
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this issue May 7, 2024
…pandas-dev#58275)

* Avoid unnecessary re-opening of HDF5 files

* Update the whatsnew file

* Move the changelog entry for pandas-dev#58248 to the correct section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants