Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally use sys.getsizeof in DataFrame.memory_usage #11595

Closed
mrocklin opened this issue Nov 13, 2015 · 14 comments · Fixed by #11596
Closed

Optionally use sys.getsizeof in DataFrame.memory_usage #11595

mrocklin opened this issue Nov 13, 2015 · 14 comments · Fixed by #11596
Labels
API Design Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@mrocklin
Copy link
Contributor

I would like to know how many bytes my dataframe takes up in memory. The standard way to do this is the memory_usage method

df.memory_usage(index=True)

For object dtype columns this measures 8 bytes per element, the size of the reference not the size of the full object. In some cases this significantly underestimates the size of the dataframe.

It might be nice to optionally map sys.getsizeof on object dtype columns to get a better estimate of the size. If this ends up being expensive then it might be good to have this as an optional keyword argument.

df.memory_usage(index=True, measure_object=True)
@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

In [13]: import sys

In [14]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [15]: df['B'] = df['A'].astype('category')

In [16]: df.memory_usage()
Out[16]: 
A    16000000
B     2000016
dtype: int64

In [17]: %timeit df.memory_usage()
10000 loops, best of 3: 66.1 µs per loop

In [18]: %timeit sum([ sys.getsizeof(v) for v in df['A'].values ])
1 loops, best of 3: 467 ms per loop

@mrocklin
Copy link
Contributor Author

Yup, probably shouldn't be default. I'd be quite happy to opt-in.

On the other side of the comparision, 500ms is very small compared to serialization and communication time if we mistakenly decide that it's a good idea to communicate this dataframe to another machine.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

going to give you a cythonized comparison in a sec

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

In [8]: import sys

In [3]: pd.lib.memory_usage_of_array(np.array(['a'],dtype=object))
Out[3]: 46

In [4]: pd.lib.memory_usage_of_array(np.array(['ab'],dtype=object))
Out[4]: 47

In [5]: pd.lib.memory_usage_of_array(np.array(['abc'],dtype=object))
Out[5]: 48

In [6]: pd.lib.memory_usage_of_array(np.array(['abcd'],dtype=object))
Out[6]: 49
In [7]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [9]: sum([ sys.getsizeof(v) for v in df['A'].values ]) + df['A'].values.nbytes
Out[9]: 96000000

In [10]: pd.lib.memory_usage_of_array(df['A'].values)
Out[10]: 96000000
In [11]: %timeit pd.lib.memory_usage_of_array(df['A'].values)
10 loops, best of 3: 108 ms per loop

In [12]: %timeit sum([ sys.getsizeof(v) for v in df['A'].values ]) + df['A'].values.nbytes
1 loops, best of 3: 481 ms per loop

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

I am giving back the original nbytes + the actual overhead (e.g. to store it costs you the pointer in the ndarray + the actual storage)

@mrocklin
Copy link
Contributor Author

Seems like a good idea. The speedup there is nice.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

yep, ok, easy enough (then would turn off the '+') if you opt-in

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

In [7]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000000 entries, 0 to 1999999
Data columns (total 1 columns):
A    object
dtypes: object(1)
memory usage: 30.5+ MB

In [9]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000000 entries, 0 to 1999999
Data columns (total 1 columns):
A    object
dtypes: object(1)
memory usage: 106.8 MB

@max-sixty
Copy link
Contributor

What do you think about overriding __sizeof__ with one of these?

@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

ahh, so sys.getsizeof(DataFrame) gives a nice answer, sure. that would be ok.

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string API Design labels Nov 13, 2015
@jreback jreback added this to the 0.17.1 milestone Nov 13, 2015
jreback added a commit to jreback/pandas that referenced this issue Nov 13, 2015
jreback added a commit that referenced this issue Nov 13, 2015
PERF/DOC:  Option to .info() and .memory_usage() to provide for deep introspection of memory consumption #11595
@jickersville
Copy link

Glad to finally see #8578 implemented. 👍 . It appears that when a Continuum co-worker complains of a pandas wart it gets fixed in 60 minutes instead of being repeatedly deflected with excuses over the course of 3 days until the user runs away screaming in exasperation.

Good work!

@shoyer @jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Mar 30, 2016

@jickersville that's not a very nice comment.

What issue has:

being repeatedly deflected with excuses over the course of 3 days until the user runs away screaming in exasperation.

????

@jreback
Copy link
Contributor

jreback commented Mar 30, 2016

since @jickersville account was created today. I suspect you are actually @kay1793 whom was banned for egregious behavior. prove me wrong here.

@shoyer
Copy link
Member

shoyer commented Mar 30, 2016

sigh

On Wed, Mar 30, 2016 at 10:23 AM, Jeff Reback notifications@github.com
wrote:

since @jickersville https://github.com/jickersville account was created
today. I suspect you are actually @kay1793 https://github.com/kay1793


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#11595 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
5 participants