-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Description
Hi,
Pandas seems to be using way too much memory after retrieving data from a HDF5 store. In general memory peaks while reading (up to 30 times of the final data step) and then (even after calling the garbage collector) it uses around 10x the memory.
If I then get the numpy arrays out, delete the dataframe and create a new one, the new dataframe is memory efficient and what you would expect.
Please find an example underneath; the memory usage is what I see in "top" being used by ipython.
Create test data first:
#Create Testdata
import pandas as pd
import numpy as np
test_df = pd.DataFrame(np.random.randn(5000000, 26), columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'])
test_df[['a', 'b']] = test_df[['a', 'b']].astype(int)
with pd.get_store(path='test.h5', mode='w', complib='blosc', complevel=9) as store_hdf:
store_hdf.append('store_0', test_df, data_columns=['a', 'b'])
Now leave (i)python and go back into it and run the test:
import pandas as pd
import numpy as np
import gc
#44MB
path_name = '/srv/www/li/test.h5'
with pd.get_store(path_name, mode='r') as fact_hdf:
test_df = fact_hdf.select('store_0', columns=['a', 'c'])
# peaks at 3.6GB, ends at 3.1GB (drops to 2.1GB a little later)
print test_df.values.nbytes / (1024*1024)
# 76MB
print test_df.index.nbytes / (1024*1024)
# 38MB
gc.collect()
#1.2GB
a = test_df['a'].values
c = test_df['c'].values
# No impact on memory usage
del test_df
gc.collect()
# drops to 154MB
test_df = pd.DataFrame({'a': a, 'c': c})
# total memory usage goes up to 268MB
print test_df.values.nbytes / (1024*1024)
# 76MB
print test_df.index.nbytes / (1024*1024)
# 38MB
# now delete the numpy arrays
del a
del c
gc.collect()
# 229MB
del test_df
gc.collect()
# 115MB
Machine Info:
- Ubuntu 12.04LTS 64bit
- Pandas 0.13.1
- Pytables 3.1.0
Any ideas on what is happening?
Metadata
Metadata
Assignees
Labels
No labels