Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.describe(percentiles=[]) still returns 50% percentile. #11866

Open
dragoljub opened this issue Dec 18, 2015 · 10 comments

Comments

@dragoljub
Copy link

@dragoljub dragoljub commented Dec 18, 2015

The DataFrame.describe() method docs seem to indicate that you can pass percentiles=None to not compute any percentiles, however by default it still computes 25%, 50% and 75%. The best I can do is pass an empty list to only compute the 50% percentile. I would think that passing an empty list would return no percentile computations.

Should we allow passing an empty list to not compute any percentiles?

pandas 0.17.1

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame(np.random.randn(10,5))

In [4]: df.describe(percentiles=None)
Out[4]:
               0          1          2          3          4          5  
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000
mean   -0.116736  -0.160728   0.066763  -0.068867  -0.242050   0.390091
std     0.771704   0.837520   0.875747   0.955985   1.093919   0.923464
min    -1.347786  -1.140541  -1.297533  -1.347824  -2.085290  -0.825807
25%    -0.580527  -0.613640  -0.558291  -0.538433  -0.836046  -0.275567
50%    -0.261526  -0.395307   0.007595  -0.248025   0.000515   0.314278
75%     0.329780   0.154053   0.708768   0.407732   0.366278   1.192338
max     1.285276   1.649528   1.485076   1.697162   1.551388   1.762939

In [15]: df.describe(percentiles=[])
Out[15]:
               0          1          2          3          4          5  
count  10.000000  10.000000  10.000000  10.000000  10.000000  10.000000
mean   -0.116736  -0.160728   0.066763  -0.068867  -0.242050   0.390091
std     0.771704   0.837520   0.875747   0.955985   1.093919   0.923464
min    -1.347786  -1.140541  -1.297533  -1.347824  -2.085290  -0.825807
50%    -0.261526  -0.395307   0.007595  -0.248025   0.000515   0.314278
max     1.285276   1.649528   1.485076   1.697162   1.551388   1.762939
@rockg

This comment has been minimized.

Copy link
Contributor

@rockg rockg commented Dec 18, 2015

I think the goal here is to return the median which I think is a useful statistic and the code comments here echo that. We can clear up the documentation if that would help. What are you trying to achieve?

@dragoljub

This comment has been minimized.

Copy link
Author

@dragoljub dragoljub commented Dec 18, 2015

I was just trying to avoid computing any percentiles/median because that often involves sorting which could take some time depending on how many columns of data you are looking at. I suppose the 50%/median makes sense to have in describe as a default. Still I would think that passing an empty list would not compute even the 50%/median.

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Dec 18, 2015

median does not involve sorting as its implemented using a skip list

in fact it's just order n (and its in c)

@shoyer

This comment has been minimized.

Copy link
Member

@shoyer shoyer commented Dec 19, 2015

Yep, there's not much to be gained by dropping percentiles -- every summary operation is O(n).

On the other hand, if you have actual big data, then you probably want to use approximate (sketch) algorithms for quantiles so you can do stream processing. But that's not really a problem for pandas...

@dragoljub

This comment has been minimized.

Copy link
Author

@dragoljub dragoljub commented Dec 19, 2015

Medians should be fast but take a look at the performance difference I'm getting. Even if I hack up a quick 'describe' function with concat and transpose its quite a bit faster than df.describe(). When I remove the median its an additional 2x faster as compared with computing the median. 😕

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(100000,1000), columns=['C{}'.format(i) for i in range(1000)])

%time a = df.describe(percentiles=[])
    Wall time: 17.8 s

%time b = pd.concat([df.count(), df.mean(), df.std(), df.min(), df.median(), df.max()], axis=1).T
    Wall time: 10.8 s

%time c = pd.concat([df.count(), df.mean(), df.std(), df.min(), df.max()], axis=1).T
    Wall time: 4.94 s

np.array_equal(a,b)
    True

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 16.0
Cython: 0.23
numpy: 1.9.2
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: None
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.1
numexpr: 2.4.3
matplotlib: 1.5.0
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.4.4
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
Jinja2: None
@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Dec 19, 2015

@dragoljub you realize his has nothing to do with median per-se and much more do with with a column-by-column application of functions. .describe is essentially a fancy .apply. Note that it could be implemented to do this on blocks and it would be much faster.

@shoyer

This comment has been minimized.

Copy link
Member

@shoyer shoyer commented Dec 19, 2015

That is a fair point... medians may still be O(n) in time and space, but they are indeed slower than calculating moments.

In any case I agree that this is a bug. A fix would be appreciated!

@jreback

This comment has been minimized.

Copy link
Contributor

@jreback jreback commented Dec 19, 2015

this is essentially the same issue as #11623 (the perf part)

@jreback jreback added this to the Next Major Release milestone Dec 19, 2015
@dragoljub

This comment has been minimized.

Copy link
Author

@dragoljub dragoljub commented Dec 19, 2015

Yes block level computation would be great! 👍

The other point I'm making is:
Should we have an escape hatch in df.describe() for users that don't want to compute medians for 1000's of columns? Even with block level computation the median computation takes several times longer than all the other statistics combined. 🐢

@RhysU

This comment has been minimized.

Copy link

@RhysU RhysU commented Feb 25, 2019

If the empty list always computes the 50th percentile, how about a documentation update indicating this is expected behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.