DataFrame Memory Usage after HDF5 retrieve

Hi,

Pandas seems to be using way too much memory  after retrieving data from a HDF5 store. In general memory peaks while reading (up to 30 times of the final data step) and then (even after calling the garbage collector) it uses around 10x the memory.
If I then get the numpy arrays out, delete the dataframe and create a new one, the new dataframe is memory efficient and what you would expect.
Please find an example underneath; the memory usage is what I see in "top" being used by ipython.

Create test data first:

```
#Create Testdata
import pandas as pd
import numpy as np
test_df = pd.DataFrame(np.random.randn(5000000, 26), columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'])
test_df[['a', 'b']] = test_df[['a', 'b']].astype(int)

with pd.get_store(path='test.h5', mode='w', complib='blosc', complevel=9) as store_hdf:
    store_hdf.append('store_0', test_df, data_columns=['a', 'b'])
```

Now leave (i)python and go back into it and run the test:

```
import pandas as pd
import numpy as np
import gc
#44MB

path_name = '/srv/www/li/test.h5'
with pd.get_store(path_name, mode='r') as fact_hdf:
    test_df = fact_hdf.select('store_0', columns=['a', 'c'])

# peaks at 3.6GB, ends at 3.1GB (drops to 2.1GB a little later)

print test_df.values.nbytes / (1024*1024)
# 76MB
print test_df.index.nbytes / (1024*1024)
# 38MB

gc.collect()
#1.2GB

a = test_df['a'].values
c = test_df['c'].values
# No impact on memory usage

del test_df
gc.collect()
# drops to 154MB

test_df = pd.DataFrame({'a': a, 'c': c})
# total memory usage goes up to 268MB
print test_df.values.nbytes / (1024*1024)
# 76MB
print test_df.index.nbytes / (1024*1024)
# 38MB

# now delete the numpy arrays
del a
del c
gc.collect()
# 229MB

del test_df
gc.collect()
# 115MB
```

Machine Info:
- Ubuntu 12.04LTS 64bit
- Pandas 0.13.1
- Pytables 3.1.0

Any ideas on what is happening?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataFrame Memory Usage after HDF5 retrieve #6379

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DataFrame Memory Usage after HDF5 retrieve #6379

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions