Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Timings/space of datatypes in the docs #3871

Closed
hayd opened this issue Jun 12, 2013 · 14 comments
Closed

DOC: Timings/space of datatypes in the docs #3871

hayd opened this issue Jun 12, 2013 · 14 comments
Labels
Docs Performance Memory or execution speed performance

Comments

@hayd
Copy link
Contributor

hayd commented Jun 12, 2013

Would it be useful to have a section in the docs discussing:

  • how much space pandas objects take up (vaguely how many rows * columns equates to size space, if I have an xGb csv how big will it be in pandas/HDF5/etc.).
  • how quick are some standard operations are likely to be (e.g. read_csv in/merge/join/etc. vs data size).
  • how do these compare to other platforms (?)

Probably distinct from comparing functionality (although that may also be interesting) e.g. like numpy do it for features against matlab here: http://wiki.scipy.org/NumPy_for_Matlab_Users) e.g. #3980

@cpcloud
Copy link
Member

cpcloud commented Jun 13, 2013

my 2c:

  1. i think it's hard to say in general how big pandas objects are b/c of homogeneity. w.r.t. homogeneous they aren't that much different from df.values.nbytes + df.columns.values.nbytes + df.index.values.nbytes (ignoring the size of other python objects needed for repring and so forth). comparing a gb order of magnitude csv to how big that will be in pandas space doesn't seem that useful since most sane folks will not be storing files that big in text. if they are then they should immediately convert the hdf5 or even just a npz file would be an improvement.
  2. i think this is interesting
  3. certain platforms, e.g., matlab completely fail at everything that pandas succeeds at. in matlab there's a bastard version of DataFrame called dataset that really is just a matrix with some labels and nothing more and that i wouldn't recommend to my worst enemy to use. it's horrible and a comparison would not be worth the time it would take to replicate even a tiny subset of what pandas does. the only comparable platform i can think of is R. i'm sure other people know others...

@hayd
Copy link
Contributor Author

hayd commented Jun 13, 2013

(Not sure why I took out a mention of R, it was there, this was the main one I had in mind :) ).

  1. It's true it varies but it might be useful to give some examples (along with a lack of generality warning) for people to get a vague idea. For some example of a csv, (with n rows and n cols), how long it takes to read in, how much space it takes up in memory, how much space it would take up as a pickle, how much space in HDF5, postgres, etc.

@hayd
Copy link
Contributor Author

hayd commented Jun 14, 2013

@hayd
Copy link
Contributor Author

hayd commented Jun 24, 2013

@jreback
Copy link
Contributor

jreback commented Jun 26, 2013

https://groups.google.com/forum/m/#!topic/pydata/G6Z-SN9SJnY for a conversion about this

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

can prob add this 2 Enhancing Performance section (or maybe should rename to Performance?

@hayd
Copy link
Contributor Author

hayd commented Jul 10, 2013

@jreback I kindof think these should be distinct, but not sure what a good name would be.

@jreback
Copy link
Contributor

jreback commented Jul 10, 2013

ok...sure..maybe a new top-level section (or maybe part of FAQ or something)

@hayd
Copy link
Contributor Author

hayd commented Aug 1, 2013

related #696, and perf of read_csv from wes' blog http://wesmckinney.com/blog/?p=543

@hayd
Copy link
Contributor Author

hayd commented Aug 6, 2013

@jreback
Copy link
Contributor

jreback commented Aug 6, 2013

Reproducing my answer here (from the above link):

You have to do this in reverse.

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

In [6]: DataFrame(randn(1000000,20)).values.nbytes
Out[6]: 160000000

Technically memory is about this (which includes the indexes)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

So 160MB in memory with a 400MB file, 1M rows of 20 float columns

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

MUCH more compact when written as a binary HDF5 file

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

Data is not that compressible though as its random.

WIth strings (same string, so maybe a little bogus) (file is about 1/2 size of the floats!)

In [26]: df = DataFrame(np.array(['ABCDEFGH']*20*1000000,dtype=object).reshape(1000000,20))

In [29]: df.values.nbytes +df.index.nbytes +df.columns.nbytes
Out[29]: 168000160

In [30]: df.to_csv('test.csv')

In [31]: !ls -ltr test.csv
-rw-rw-r-- 1 users 186888941 Aug  6 17:29 test.csv

In [32]: df.to_hdf('test.h5','df')

In [33]: !ls -ltr test.h5
-rw-rw-r-- 1 users 49166896 Aug  6 17:29 test.h5

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 20, 2014
@jreback
Copy link
Contributor

jreback commented Apr 8, 2014

Just put this in for perf comparison of IO methods: 0d79ff8

so paritial progress for this

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MarcoGorelli
Copy link
Member

closing due to lack of activity, and it's not really clear what's needed anymore at this point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants