Skip to content

DataFrame Memory Usage after HDF5 retrieve #6379

@CarstVaartjes

Description

@CarstVaartjes

Hi,

Pandas seems to be using way too much memory after retrieving data from a HDF5 store. In general memory peaks while reading (up to 30 times of the final data step) and then (even after calling the garbage collector) it uses around 10x the memory.
If I then get the numpy arrays out, delete the dataframe and create a new one, the new dataframe is memory efficient and what you would expect.
Please find an example underneath; the memory usage is what I see in "top" being used by ipython.

Create test data first:

#Create Testdata
import pandas as pd
import numpy as np
test_df = pd.DataFrame(np.random.randn(5000000, 26), columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'])
test_df[['a', 'b']] = test_df[['a', 'b']].astype(int)

with pd.get_store(path='test.h5', mode='w', complib='blosc', complevel=9) as store_hdf:
    store_hdf.append('store_0', test_df, data_columns=['a', 'b'])

Now leave (i)python and go back into it and run the test:

import pandas as pd
import numpy as np
import gc
#44MB

path_name = '/srv/www/li/test.h5'
with pd.get_store(path_name, mode='r') as fact_hdf:
    test_df = fact_hdf.select('store_0', columns=['a', 'c'])

# peaks at 3.6GB, ends at 3.1GB (drops to 2.1GB a little later)

print test_df.values.nbytes / (1024*1024)
# 76MB
print test_df.index.nbytes / (1024*1024)
# 38MB

gc.collect()
#1.2GB

a = test_df['a'].values
c = test_df['c'].values
# No impact on memory usage

del test_df
gc.collect()
# drops to 154MB

test_df = pd.DataFrame({'a': a, 'c': c})
# total memory usage goes up to 268MB
print test_df.values.nbytes / (1024*1024)
# 76MB
print test_df.index.nbytes / (1024*1024)
# 38MB

# now delete the numpy arrays
del a
del c
gc.collect()
# 229MB

del test_df
gc.collect()
# 115MB

Machine Info:

  • Ubuntu 12.04LTS 64bit
  • Pandas 0.13.1
  • Pytables 3.1.0

Any ideas on what is happening?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions