New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally use sys.getsizeof in DataFrame.memory_usage #11595

Closed
mrocklin opened this Issue Nov 13, 2015 · 14 comments

Comments

Projects
None yet
5 participants
@mrocklin
Contributor

mrocklin commented Nov 13, 2015

I would like to know how many bytes my dataframe takes up in memory. The standard way to do this is the memory_usage method

df.memory_usage(index=True)

For object dtype columns this measures 8 bytes per element, the size of the reference not the size of the full object. In some cases this significantly underestimates the size of the dataframe.

It might be nice to optionally map sys.getsizeof on object dtype columns to get a better estimate of the size. If this ends up being expensive then it might be good to have this as an optional keyword argument.

df.memory_usage(index=True, measure_object=True)
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor
In [13]: import sys

In [14]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [15]: df['B'] = df['A'].astype('category')

In [16]: df.memory_usage()
Out[16]: 
A    16000000
B     2000016
dtype: int64

In [17]: %timeit df.memory_usage()
10000 loops, best of 3: 66.1 µs per loop

In [18]: %timeit sum([ sys.getsizeof(v) for v in df['A'].values ])
1 loops, best of 3: 467 ms per loop
Contributor

jreback commented Nov 13, 2015

In [13]: import sys

In [14]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [15]: df['B'] = df['A'].astype('category')

In [16]: df.memory_usage()
Out[16]: 
A    16000000
B     2000016
dtype: int64

In [17]: %timeit df.memory_usage()
10000 loops, best of 3: 66.1 µs per loop

In [18]: %timeit sum([ sys.getsizeof(v) for v in df['A'].values ])
1 loops, best of 3: 467 ms per loop
@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Nov 13, 2015

Contributor

Yup, probably shouldn't be default. I'd be quite happy to opt-in.

On the other side of the comparision, 500ms is very small compared to serialization and communication time if we mistakenly decide that it's a good idea to communicate this dataframe to another machine.

Contributor

mrocklin commented Nov 13, 2015

Yup, probably shouldn't be default. I'd be quite happy to opt-in.

On the other side of the comparision, 500ms is very small compared to serialization and communication time if we mistakenly decide that it's a good idea to communicate this dataframe to another machine.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor

going to give you a cythonized comparison in a sec

Contributor

jreback commented Nov 13, 2015

going to give you a cythonized comparison in a sec

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor
In [8]: import sys

In [3]: pd.lib.memory_usage_of_array(np.array(['a'],dtype=object))
Out[3]: 46

In [4]: pd.lib.memory_usage_of_array(np.array(['ab'],dtype=object))
Out[4]: 47

In [5]: pd.lib.memory_usage_of_array(np.array(['abc'],dtype=object))
Out[5]: 48

In [6]: pd.lib.memory_usage_of_array(np.array(['abcd'],dtype=object))
Out[6]: 49
In [7]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [9]: sum([ sys.getsizeof(v) for v in df['A'].values ]) + df['A'].values.nbytes
Out[9]: 96000000

In [10]: pd.lib.memory_usage_of_array(df['A'].values)
Out[10]: 96000000
In [11]: %timeit pd.lib.memory_usage_of_array(df['A'].values)
10 loops, best of 3: 108 ms per loop

In [12]: %timeit sum([ sys.getsizeof(v) for v in df['A'].values ]) + df['A'].values.nbytes
1 loops, best of 3: 481 ms per loop
Contributor

jreback commented Nov 13, 2015

In [8]: import sys

In [3]: pd.lib.memory_usage_of_array(np.array(['a'],dtype=object))
Out[3]: 46

In [4]: pd.lib.memory_usage_of_array(np.array(['ab'],dtype=object))
Out[4]: 47

In [5]: pd.lib.memory_usage_of_array(np.array(['abc'],dtype=object))
Out[5]: 48

In [6]: pd.lib.memory_usage_of_array(np.array(['abcd'],dtype=object))
Out[6]: 49
In [7]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [9]: sum([ sys.getsizeof(v) for v in df['A'].values ]) + df['A'].values.nbytes
Out[9]: 96000000

In [10]: pd.lib.memory_usage_of_array(df['A'].values)
Out[10]: 96000000
In [11]: %timeit pd.lib.memory_usage_of_array(df['A'].values)
10 loops, best of 3: 108 ms per loop

In [12]: %timeit sum([ sys.getsizeof(v) for v in df['A'].values ]) + df['A'].values.nbytes
1 loops, best of 3: 481 ms per loop
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor

I am giving back the original nbytes + the actual overhead (e.g. to store it costs you the pointer in the ndarray + the actual storage)

Contributor

jreback commented Nov 13, 2015

I am giving back the original nbytes + the actual overhead (e.g. to store it costs you the pointer in the ndarray + the actual storage)

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Nov 13, 2015

Contributor

Seems like a good idea. The speedup there is nice.

Contributor

mrocklin commented Nov 13, 2015

Seems like a good idea. The speedup there is nice.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor

yep, ok, easy enough (then would turn off the '+') if you opt-in

Contributor

jreback commented Nov 13, 2015

yep, ok, easy enough (then would turn off the '+') if you opt-in

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor
In [7]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000000 entries, 0 to 1999999
Data columns (total 1 columns):
A    object
dtypes: object(1)
memory usage: 30.5+ MB

In [9]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000000 entries, 0 to 1999999
Data columns (total 1 columns):
A    object
dtypes: object(1)
memory usage: 106.8 MB
Contributor

jreback commented Nov 13, 2015

In [7]: df = DataFrame({'A' : ['foo','bar']*1000000})

In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000000 entries, 0 to 1999999
Data columns (total 1 columns):
A    object
dtypes: object(1)
memory usage: 30.5+ MB

In [9]: df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000000 entries, 0 to 1999999
Data columns (total 1 columns):
A    object
dtypes: object(1)
memory usage: 106.8 MB
@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Nov 13, 2015

Contributor

What do you think about overriding __sizeof__ with one of these?

Contributor

max-sixty commented Nov 13, 2015

What do you think about overriding __sizeof__ with one of these?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor

ahh, so sys.getsizeof(DataFrame) gives a nice answer, sure. that would be ok.

Contributor

jreback commented Nov 13, 2015

ahh, so sys.getsizeof(DataFrame) gives a nice answer, sure. that would be ok.

@jreback jreback added this to the 0.17.1 milestone Nov 13, 2015

jreback added a commit to jreback/pandas that referenced this issue Nov 13, 2015

jreback added a commit that referenced this issue Nov 13, 2015

Merge pull request #11596 from jreback/memory
PERF/DOC:  Option to .info() and .memory_usage() to provide for deep introspection of memory consumption #11595
@jickersville

This comment has been minimized.

Show comment
Hide comment
@jickersville

jickersville Mar 30, 2016

Glad to finally see #8578 implemented. 👍 . It appears that when a Continuum co-worker complains of a pandas wart it gets fixed in 60 minutes instead of being repeatedly deflected with excuses over the course of 3 days until the user runs away screaming in exasperation.

Good work!

@shoyer @jorisvandenbossche

jickersville commented Mar 30, 2016

Glad to finally see #8578 implemented. 👍 . It appears that when a Continuum co-worker complains of a pandas wart it gets fixed in 60 minutes instead of being repeatedly deflected with excuses over the course of 3 days until the user runs away screaming in exasperation.

Good work!

@shoyer @jorisvandenbossche

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 30, 2016

Contributor

@jickersville that's not a very nice comment.

What issue has:

being repeatedly deflected with excuses over the course of 3 days until the user runs away screaming in exasperation.

????

Contributor

jreback commented Mar 30, 2016

@jickersville that's not a very nice comment.

What issue has:

being repeatedly deflected with excuses over the course of 3 days until the user runs away screaming in exasperation.

????

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 30, 2016

Contributor

since @jickersville account was created today. I suspect you are actually @kay1793 whom was banned for egregious behavior. prove me wrong here.

Contributor

jreback commented Mar 30, 2016

since @jickersville account was created today. I suspect you are actually @kay1793 whom was banned for egregious behavior. prove me wrong here.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Mar 30, 2016

Member

sigh

On Wed, Mar 30, 2016 at 10:23 AM, Jeff Reback notifications@github.com
wrote:

since @jickersville https://github.com/jickersville account was created
today. I suspect you are actually @kay1793 https://github.com/kay1793


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#11595 (comment)

Member

shoyer commented Mar 30, 2016

sigh

On Wed, Mar 30, 2016 at 10:23 AM, Jeff Reback notifications@github.com
wrote:

since @jickersville https://github.com/jickersville account was created
today. I suspect you are actually @kay1793 https://github.com/kay1793


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#11595 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment