Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: optimize memory usage for to_hdf #9648

Merged
merged 1 commit into from
Mar 16, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 13, 2015

from here

reduce memeory usage necessary for using to_hdf

  • was copying always in astyping
  • was ravelling then reshaping
  • was constantly allocating a new chunked buffer, now re-uses the same buffer
In [1]: df = pd.DataFrame(np.random.rand(1000000,500))
df.info()

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Columns: 500 entries, 0 to 499
dtypes: float64(500)
memory usage: 3.7 GB

Previously

In [3]: %memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 11029.49 MiB, increment: 7130.57 MiB

With PR

In [2]: memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 4669.21 MiB, increment: 794.57 MiB

@jreback jreback added Performance Memory or execution speed performance IO HDF5 read_hdf, HDFStore labels Mar 13, 2015
@jreback jreback added this to the 0.16.0 milestone Mar 13, 2015
@jreback jreback force-pushed the pytables_memory branch 6 times, most recently from d0f8583 to 21e727d Compare March 15, 2015 22:38
jreback added a commit that referenced this pull request Mar 16, 2015
PERF: optimize memory usage for to_hdf
@jreback jreback merged commit 269af25 into pandas-dev:master Mar 16, 2015
@bwillers
Copy link
Contributor

If you happen to find yourself in new york I'm buying you a beer for this fix.

@jreback
Copy link
Contributor Author

jreback commented Mar 28, 2015

hahha I figured I broke it I should fix it

in Nyc, so anytime!

@tomanizer
Copy link

Thanks a lot for fixing this! It helps a lot.

@sagol
Copy link

sagol commented Apr 22, 2019

the bug is back

df = pd.DataFrame(np.random.rand(1000000,500))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 500 entries, 0 to 499
dtypes: float64(500)
memory usage: 3.7 GB

%memit -r 1 df.to_hdf('test.h5','df',format='table',mode='w')
peak memory: 7934.20 MiB, increment: 3823.80 MiB

pd.__version__
'0.24.2'

With a more complex structure, everything is much worse.

data_ifa.info()
<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, b88d3b87-3432-43cc-8219-f45d97389d8f to eb705297-94e8-4ccf-a910-5f3b9734d572
Data columns (total 2 columns):
bundles        100000 non-null object
bundles_len    100000 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.3+ MB

%memit -r 1 data_ifa.to_hdf(full_file_name_hd5, key='data_ifa', encoding='utf-8', complevel=9, mode='w', format='table')
peak memory: 22106.07 MiB, increment: 21324.53 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants