Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: more flexible describe() + tests #8164

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
4 participants
@bthyreau
Copy link
Contributor

commented Sep 3, 2014

this patch adds a return_type keyword argument to describe() to make it more
flexible to use on mixed-type dataframe. User can now select among returning
numeric, categorical, or both, as well as 'auto' (previous behaviour, default),
and 'same', which keep columns identical (useful, e.g. with groupby())

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2014

pls show examples of the use case for this

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2014

Rationale.

When using dataframe of mixed type, ie. containing numeric values, string, categorical, etc, the current behaviour of describe() is a bit rough, as its only summarize over numerical columns only, or if none exists, over categorical columns only. With this change, describe() gets more flexible in its return form, which considerably smoothed my interactive data-analysis sessions.

From the doc

  return_type : string, default 'auto'
       The result format. Most relevant for mixed-types dataframes.
       One of 'numeric_only', 'categorical_only', 'both', 'same', 'auto'
       'auto' returns the 'numeric_only' description, except on fully
       categorical frames, where it returns 'categorical_only'.
       'both' returns the pair, 'same' returns a frame with same columns
       as original dataframe

Example

Although real-life scenario are more convincing, here is small examples

In [1]: import numpy as np
In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'colA': ['foo', 'foo', 'bar'] * 10,
   ...:                         'colB': ['a', 'b', 'c', 'd', 'e'] * 6,
   ...:                         'colC': np.arange(30), 'colD' : np.ones(30)})

In [4]: df.head()
Out[4]: 
  colA colB  colC  colD
0  foo    a     0     1
1  foo    b     1     1
2  bar    c     2     1
3  foo    d     3     1
4  foo    e     4     1

# old behaviour pick columns based on the types of the dataframe. Not so nice.
In [6]: df.describe()
Out[6]: 
            colC  colD
count  30.000000    30
mean   14.500000     1
std     8.803408     0
min     0.000000     1
25%     7.250000     1
50%    14.500000     1
75%    21.750000     1
max    29.000000     1

# using the new option, we can explictely asks to describe both types
In [8]: df.describe(return_type="categorical_only")
Out[8]: 
       colA colB
count    30   30
unique    2    5
top     foo    d
freq     20    6

In [9]: df.describe(return_type="numeric_only")
Out[9]: 
            colC  colD
count  30.000000    30
mean   14.500000     1
std     8.803408     0
min     0.000000     1
25%     7.250000     1
50%    14.500000     1
75%    21.750000     1
max    29.000000     1

# using option "same" returns a df with similar-columns
In [11]: df.describe(return_type="same") 
Out[11]: 
       colA colB      colC colD
count    30   30        30   30
unique    2    5       NaN  NaN
top     foo    d       NaN  NaN
freq     20    6       NaN  NaN
mean    NaN  NaN      14.5    1
std     NaN  NaN  8.803408    0
min     NaN  NaN         0    1
25%     NaN  NaN      7.25    1
50%     NaN  NaN      14.5    1
75%     NaN  NaN     21.75    1
max     NaN  NaN        29    1

# one of my favorite pattern, using groupby:
In [13]: out = df.groupby("colA").describe(return_type="same")

In [14]: out.unstack(0)
       colB           colC           colD     
colA    bar  foo       bar       foo  bar  foo
count    10   20        10        20   10   20
unique    5    5       NaN       NaN  NaN  NaN
top       d    d       NaN       NaN  NaN  NaN
freq      2    4       NaN       NaN  NaN  NaN
mean    NaN  NaN      15.5        14    1    1
std     NaN  NaN  9.082951  8.855566    0    0
min     NaN  NaN         2         0    1    1
25%     NaN  NaN      8.75      6.75    1    1
50%     NaN  NaN      15.5        14    1    1
75%     NaN  NaN     22.25     21.25    1    1
max     NaN  NaN        29        28    1    1
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2014

Is their a reason you think that the above approach is better than:

In [8]: df.select_dtypes(['object']).describe()
Out[8]: 
       colA colB
count    30   30
unique    2    5
top     foo    d
freq     20    6

In [9]: df.select_dtypes(['number']).describe()
Out[9]: 
            colC  colD
count  30.000000    30
mean   14.500000     1
std     8.803408     0
min     0.000000     1
25%     7.250000     1
50%    14.500000     1
75%    21.750000     1
max    29.000000     1

(possibly adding include/exclude to .describe then doing a select_dtypes might be ok though)

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Sep 4, 2014

  • The current behaviour of describe() is un-intuitive (unless the dataframe is homogenous type.) as it returns a subset of columns; and that subset depends on other included columns.
  • describe(), as an easy, short, concise way to get an overview of the df values, is especially appreciable during interactive sessions. Using a manual approach, or "select_dtypes" pre-filterings, can of course eventually output the results, but it defeat some of the practicality of describe().
  • Getting the result of describe() with the same columns as original df is quite handy. e.g. you can re-use a same columns-subset list. For e.g. compare
In [60]: df.dtypes
Out[60]: 
colA     object
colB     object
colC      int32
colD    float64
dtype: object


In [56]: model_col = ["colA","colB"]

In [57]: df.loc[:,model_col].describe().loc[:,model_col]
Out[57]: 
       colA colB
count    30   30
unique    2    5
top     foo    d
freq     20    6

In [58]: model_col = ["colA","colB","colC"]

In [59]: df.loc[:,model_col].describe().loc[:,model_col]
Out[59]: 
       colA  colB       colC
count   NaN   NaN  30.000000
mean    NaN   NaN  14.500000
std     NaN   NaN   8.803408
min     NaN   NaN   0.000000
25%     NaN   NaN   7.250000
50%     NaN   NaN  14.500000
75%     NaN   NaN  21.750000
max     NaN   NaN  29.000000

Here we have lost the count(),unique(), first(), etc. of colA and colB as soon as we introduced colC in the model

However, it's not lost anymore when using return_type = "same"

In [61]: df.loc[:,model_col].describe(return_type="same").loc[:,model_col]
Out[61]: 
       colA colB      colC
count    30   30        30
unique    2    5       NaN
top     foo    d       NaN
freq     20    6       NaN
mean    NaN  NaN      14.5
std     NaN  NaN  8.803408
min     NaN  NaN         0
25%     NaN  NaN      7.25
50%     NaN  NaN      14.5
75%     NaN  NaN     21.75
max     NaN  NaN        29
  • It also works out-of-the-box with groupby() results objects. Very useful when interacting with data in an exploratory way, i can easily see/stack the summaries results of my various grouping pivot. An alternative would need several lines of code, possibly the need to create a small helper function (or an ugly lambda) every time.

In the exemple before, even with select_dtypes(), I'm not even sure how to properly include a summary for colB (categorical) in the df.groupby("colA").describe() default output. In my proposal, it's just a matter of df.groupby("colA").describe(return_type="same").

Of course, it's more convincing with real-world large dataframe of mixed types, (as used e.g. in Psychology) where it's easy to mentally lost track of every columns and their types.

  • most of the internal type-selecting logic was already implemented, so was only a matter of exposing it
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 4, 2014

ok, your idea of 'same' is ok, but the API is not consistent with the pandas style.

I would be ok with adding include,exclude keywords to .describe to filter (with the default of include='number',exclude=None. You could simply make it accept include='same' (and intercept that) to implement the functionailiy (and use .select_dtypes internally).

its a well-constructed and general API by @cpcloud

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Sep 5, 2014

ok, i implemented your suggested API, and it's indeed more flexible, while retaining the usability. Great !

Now it's possible to specify output form using include=/exclude= list.
I override 'include' to add the 'all' (alias '*') keyword, which enforce no-filtering, and the None/None (default) pair the enforce the previous, type-dependent behaviour.

Some snippet below:

>>> from pandas import Series
>>> from pandas import DataFrame
>>> import pandas.util.testing as tm
>>> import numpy as np
>>> 
>>> df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
...         'catB': ['a', 'b', 'c', 'd'] * 6,
...         'numC': np.arange(24),
...         'numD': np.arange(24.) + .5,
...         'ts': tm.makeTimeSeries()[:24].index})
>>> 
>>> 
>>> df.describe(include=["number","object"])
       catA catB       numC       numD
count    24   24  24.000000  24.000000
unique    2    4        NaN        NaN
top     foo    d        NaN        NaN
freq     16    6        NaN        NaN
mean    NaN  NaN  11.500000  12.000000
std     NaN  NaN   7.071068   7.071068
min     NaN  NaN   0.000000   0.500000
25%     NaN  NaN   5.750000   6.250000
50%     NaN  NaN  11.500000  12.000000
75%     NaN  NaN  17.250000  17.750000
max     NaN  NaN  23.000000  23.500000
>>> df.loc[:,:].describe() # as before
            numC       numD
count  24.000000  24.000000
mean   11.500000  12.000000
std     7.071068   7.071068
min     0.000000   0.500000
25%     5.750000   6.250000
50%    11.500000  12.000000
75%    17.250000  17.750000
max    23.000000  23.500000
>>> 
>>> df.loc[:,['catA','catB','ts']].describe() # contains NaN, as before
       catA catB                   ts
count    24   24                   24
unique    2    4                   24
first   NaN  NaN  2000-01-03 00:00:00
last    NaN  NaN  2000-02-03 00:00:00
top     foo    d  2000-01-31 00:00:00
freq     16    6                    1
>>> 
>>> df.describe(include=["object"])
       catA catB
count    24   24
unique    2    4
top     foo    d
freq     16    6
>>> df.describe(include='*')
       catA catB       numC       numD                   ts
count    24   24  24.000000  24.000000                   24
unique    2    4        NaN        NaN                   24
top     foo    d        NaN        NaN  2000-01-31 00:00:00
freq     16    6        NaN        NaN                    1
first   NaN  NaN        NaN        NaN  2000-01-03 00:00:00
last    NaN  NaN        NaN        NaN  2000-02-03 00:00:00
mean    NaN  NaN  11.500000  12.000000                  NaN
std     NaN  NaN   7.071068   7.071068                  NaN
min     NaN  NaN   0.000000   0.500000                  NaN
25%     NaN  NaN   5.750000   6.250000                  NaN
50%     NaN  NaN  11.500000  12.000000                  NaN
75%     NaN  NaN  17.250000  17.750000                  NaN
max     NaN  NaN  23.000000  23.500000                  NaN
>>> 
>>> df.loc[:,['catA','catB']].describe(include='*')
       catA catB
count    24   24
unique    2    4
top     foo    d
freq     16    6
>>> df.describe(include='*', exclude='XXX')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/generic.py", line 3681, in describe
    raise ValueError("exclude must be None when include is '%s'" % include)
ValueError: exclude must be None when include is '*'
>>> 
>>> df.groupby("catA").describe(include='*') # my favorite
            catB       numC       numD                   ts
catA                                                       
bar  count     8   8.000000   8.000000                    8
     unique    4        NaN        NaN                    8
     top       d        NaN        NaN  2000-01-31 00:00:00
     freq      2        NaN        NaN                    1
     first   NaN        NaN        NaN  2000-01-05 00:00:00
     last    NaN        NaN        NaN  2000-02-03 00:00:00
     mean    NaN  12.500000  13.000000                  NaN
     std     NaN   7.348469   7.348469                  NaN
     min     NaN   2.000000   2.500000                  NaN
     25%     NaN   7.250000   7.750000                  NaN
     50%     NaN  12.500000  13.000000                  NaN
     75%     NaN  17.750000  18.250000                  NaN
     max     NaN  23.000000  23.500000                  NaN
foo  count    16  16.000000  16.000000                   16
     unique    4        NaN        NaN                   16
     top       d        NaN        NaN  2000-01-25 00:00:00
     freq      4        NaN        NaN                    1
     first   NaN        NaN        NaN  2000-01-03 00:00:00
     last    NaN        NaN        NaN  2000-02-02 00:00:00
     mean    NaN  11.000000  11.500000                  NaN
     std     NaN   7.118052   7.118052                  NaN
     min     NaN   0.000000   0.500000                  NaN
     25%     NaN   5.500000   6.000000                  NaN
     50%     NaN  11.000000  11.500000                  NaN
     75%     NaN  16.500000  17.000000                  NaN
     max     NaN  22.000000  22.500000                  NaN
>>> df.groupby("catA").describe(include=["object", "datetime", "number"], exclude=["float"])
            catB       numC                   ts
catA                                            
bar  count     8   8.000000                    8
     unique    4        NaN                    8
     top       d        NaN  2000-01-31 00:00:00
     freq      2        NaN                    1
     first   NaN        NaN  2000-01-05 00:00:00
     last    NaN        NaN  2000-02-03 00:00:00
     mean    NaN  12.500000                  NaN
     std     NaN   7.348469                  NaN
     min     NaN   2.000000                  NaN
     25%     NaN   7.250000                  NaN
     50%     NaN  12.500000                  NaN
     75%     NaN  17.750000                  NaN
     max     NaN  23.000000                  NaN
foo  count    16  16.000000                   16
     unique    4        NaN                   16
     top       d        NaN  2000-01-25 00:00:00
     freq      4        NaN                    1
     first   NaN        NaN  2000-01-03 00:00:00
     last    NaN        NaN  2000-02-02 00:00:00
     mean    NaN  11.000000                  NaN
     std     NaN   7.118052                  NaN
     min     NaN   0.000000                  NaN
     25%     NaN   5.500000                  NaN
     50%     NaN  11.000000                  NaN
     75%     NaN  16.500000                  NaN
     max     NaN  22.000000                  NaN

Some Design decision minor points

  • Meaningless input raise exceptions (mostly through select_dtypes), except through groupby() who catch them and fallback to returning 'all'.
  • the index of the returned results (count/mean/50%/unique/first...) still vary depending on the dataframe content (which kindof contradict my initial complain for columns, i never saw that as a problem when it concerned rows). Its order is also not stable.
  • I thought it might be convenient to move the '*' dtype-filtering trick directly into select_dtypes(). Would it be useful ?
@jreback

View changes

pandas/core/generic.py Outdated
for x in fself.columns]
# merge individual outputs, preserving index order as possible
names = []
ldesc_indexes = sorted([x.index for x in ldesc], key=len)

This comment has been minimized.

Copy link
@jreback

jreback Sep 5, 2014

Contributor

hmm concat does this by default (preserve order and concat other axes)
so no need for all this section (you might need to tweak some options to concat)

This comment has been minimized.

Copy link
@bthyreau

bthyreau Sep 6, 2014

Author Contributor

Unless i missed something, i believe by default only the columns order is preserved; the index order is lexsorted; that makes the output a bit unfriendly. Without it:

pd.concat(ldesc, keys=fself.columns, axis=1)
Out[38]: 
       catA catB       numC       numD                   ts
25%     NaN  NaN   5.750000   6.250000                  NaN
50%     NaN  NaN  11.500000  12.000000                  NaN
75%     NaN  NaN  17.250000  17.750000                  NaN
count    24   24  24.000000  24.000000                   24
first   NaN  NaN        NaN        NaN  2000-01-03 00:00:00
freq     16    6        NaN        NaN                    1
last    NaN  NaN        NaN        NaN  2000-02-03 00:00:00
max     NaN  NaN  23.000000  23.500000                  NaN
mean    NaN  NaN  11.500000  12.000000                  NaN
min     NaN  NaN   0.000000   0.500000                  NaN
std     NaN  NaN   7.071068   7.071068                  NaN
top     foo    d        NaN        NaN  2000-01-31 00:00:00
unique    2    4        NaN        NaN                   24

It interleaved different type's output.Though the output index-order of describe() is not supposed to be guaranteed, i found it greatly convenient for interactive sessions.

This comment has been minimized.

Copy link
@jreback

jreback Sep 6, 2014

Contributor

you don't need to use keys
that's the problem , just construct the sub frames and concat glues them together in the same order

This comment has been minimized.

Copy link
@bthyreau

bthyreau Sep 6, 2014

Author Contributor

Sorry i'm confused.

Here, the ldecs list is already in the column order, ready to be concat'ed; i use keys= only as a way to set back the column names that Series.describe()s lost.
The index (row) order has to be ordered "by blocks" according to a logic which can not trivially be infered from the list (ie. categorical rows if any; then numeric if any); even if i could somehow avoid the lexsort and use inputs' orders (which i failed to), it would order the blocks differently depending on the first passed column's dtype, which is less friendly for user. So i believe it's necessary to pre-create the index order ("names" here)

I admit i could use join_axes=Index(names) instead of .loc[names], though.

@jreback

View changes

pandas/core/generic.py Outdated
if self.ndim >= 3:
msg = "describe is not implemented on on Panel or PanelND objects."
raise NotImplementedError(msg)

if (self.ndim > 1) and not (include is None and exclude is None):
if (include == 'all' or include == '*'):
if exclude != None:

This comment has been minimized.

Copy link
@jreback

jreback Sep 5, 2014

Contributor

I think adding all/* is reasonable for select_dtypes

can u do a separate pr (can merge before this one )
with tests

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 5, 2014

Introducing this kind of coupling hardly seems worth it for the "inconvenience" of having to call a single method.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 5, 2014

Also how is all and star in select dtypes different from just not calling the method?

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2014

Ok, following the previous comment, i refrained from touching select_dtypes.

I left the row-creation logic as it was, as i believed it was necessary, as discussed before.

I'm quite happy with the current code. In addition to docstrings, I also added a brief overview in the main-doc.

[I also have a potential short changelog doc, but i guess it's rude to commit it (e.g. in v0.15.txt) before knowing if you plan to merge this at all :)]

@bthyreau bthyreau changed the title more flexible describe() + tests ENH: more flexible describe() + tests Sep 11, 2014

@jreback

View changes

doc/source/basics.rst Outdated
.. ipython:: python

frame = DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
frame.describe(include=['object'])

This comment has been minimized.

Copy link
@jreback

jreback Sep 11, 2014

Contributor

show frame.describe(), e.g. the default case

@jreback

View changes

pandas/core/generic.py Outdated
To select all numeric types use numpy numpy.number. To select
categorical objects use type object. See also the select_dtypes
documentation. eg. df.describe(include=['O'])
- If both are None (default), the result will include only

This comment has been minimized.

Copy link
@jreback

jreback Sep 11, 2014

Contributor

put this first

@jreback

View changes

pandas/core/generic.py Outdated
names.append(name)
d = pd.concat(ldesc, axis=1).loc[names]
return d

if percentile_width is not None and percentiles is not None:

This comment has been minimized.

Copy link
@jreback

jreback Sep 11, 2014

Contributor

Why are you not doing this instead for all this loolping (starting at 3687)

In [7]:    df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
                        'catB': ['a', 'b', 'c', 'd'] * 6,
                        'numC': np.arange(24),
                        'numD': np.arange(24.) + .5,
                        'ts': tm.makeTimeSeries()[:24].index})

In [8]: df.apply(lambda x: x.describe())
Out[8]: 
       catA catB       numC       numD                   ts
25%     NaN  NaN   5.750000   6.250000                  NaN
50%     NaN  NaN  11.500000  12.000000                  NaN
75%     NaN  NaN  17.250000  17.750000                  NaN
count    24   24  24.000000  24.000000                   24
first   NaN  NaN        NaN        NaN  2000-01-03 00:00:00
freq     16    6        NaN        NaN                    1
last    NaN  NaN        NaN        NaN  2000-02-03 00:00:00
max     NaN  NaN  23.000000  23.500000                  NaN
mean    NaN  NaN  11.500000  12.000000                  NaN
min     NaN  NaN   0.000000   0.500000                  NaN
std     NaN  NaN   7.071068   7.071068                  NaN
top     foo    d        NaN        NaN  2000-01-31 00:00:00
unique    2    4        NaN        NaN                   24
@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Sep 12, 2014

Ok, i updated the main doc and docstring following your request.
Thanks

As for the rationale of those loops, this is necessary to compute the order of the row axis (statistics-list). The describe() functions must output results immediately practical for users, but without the loop, as you showed, percentiles are not surrounded by min/max; count is at the middle, etc; due to the default lexsorting logic of Index operations.

In detail, in the snippet below:

  • This first loop is basically the apply (except the cat'ing is delayed later, so i can still have access to series's length)
  • The second loop is the index-sorting logic

That's why i gave-up using apply in this case. I also experimented other way, such as the various Index manipulation functions, or pre-computing the rows-keys, etc. but it didn't improve much.

Note also that, as a side effect, the whole function itself seems to be slightly faster than the sole logicless apply.

def test1(fself, percentile_width = None, percentiles = []):
            ldesc = []
            for name, col in fself.iteritems():
                s = col.describe(percentile_width=percentile_width,\
                            percentiles=percentiles)
                s.name = name
                ldesc.append(s)
            # set a convenient order for rows
            names = []
            ldesc_indexes = sorted([x.index for x in ldesc], key=len)
            for idxnames in ldesc_indexes:
                for name in idxnames:
                    if name not in names:
                        names.append(name)
            d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
            return d

In [84]: %timeit test1(df, percentiles=[.42])
100 loops, best of 3: 5.4 ms per loop

In [85]: %timeit df.apply(lambda x : x.describe(percentile_width = None, percentiles=[.42]))
100 loops, best of 3: 6.59 ms per loop

Same pattern on a wider (24, 500)-shaped df: 458 ms vs 499 ms

@jreback

View changes

pandas/core/generic.py Outdated
# set a convenient order for rows
names = []
ldesc_indexes = sorted([x.index for x in ldesc], key=len)
for idxnames in ldesc_indexes:

This comment has been minimized.

Copy link
@jreback

jreback Sep 13, 2014

Contributor

at the very list this should be a list comprehension then

@jreback

View changes

pandas/core/generic.py Outdated
fself = self.select_dtypes(include=include, exclude=exclude)
# simply apply for each column in this case
ldesc = []
for name, col in fself.iteritems():

This comment has been minimized.

Copy link
@jreback

jreback Sep 13, 2014

Contributor

list comprehension here (over a function which does the action).

This comment has been minimized.

Copy link
@bthyreau

bthyreau Sep 14, 2014

Author Contributor

Not fitting here, as that isn't a simple mapping/filtering loop. (ie. the comprehension's function-call would have to be made stateful)

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2014

Well, it's only style, but if you want back the list-comprehension, then fine; while at it, to make some actual improvement, i changed the behaviour on Series so that the index name got filed at creation time. See commit.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 14, 2014

pls squash to a single commit

@jreback

View changes

pandas/core/generic.py Outdated
fself = self.select_dtypes(include=include, exclude=exclude)
ldesc = [col.describe(percentile_width=percentile_width,
percentiles=percentiles) for _, col in fself.iteritems()]
# set a convenient order for rows

This comment has been minimized.

Copy link
@jreback

jreback Sep 14, 2014

Contributor

blank line betwwen blocks

@jreback

View changes

pandas/tests/test_generic.py Outdated
def test_describe_typefiltering(self):
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
'catB': ['a', 'b', 'c', 'd'] * 6,
'numC': np.arange(24),

This comment has been minimized.

Copy link
@jreback

jreback Sep 14, 2014

Contributor

use np.arange(24,dtype='int64') otherwise these fail on 32-bit

This comment has been minimized.

Copy link
@bthyreau

bthyreau Sep 16, 2014

Author Contributor

ok, complied;
(but FYI, i'm actually working on a 32-bit system, and nothing is failing here (since type of input and output are similar, and i avoided explicitely-sized dtypes in the tests). If you think something should have failed on my 32-bit, please be more explicit so a test can be added.)

@@ -1012,6 +1012,85 @@ def test_describe_objects(self):
assert_frame_equal(df[['C1', 'C3']].describe(), df[['C3']].describe())
assert_frame_equal(df[['C2', 'C3']].describe(), df[['C3']].describe())

def test_describe_typefiltering(self):
df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,

This comment has been minimized.

Copy link
@jreback

jreback Sep 14, 2014

Contributor

add the issue number as a comment here

@bthyreau bthyreau force-pushed the bthyreau:betterdescribe branch Sep 16, 2014

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Sep 16, 2014

Ok, i squashed all commits into one, which updates the code and the main doc.
What i haven't commited yet is the changelog/release-notes text, which i have ready. Should i commit it somewhere ?
Thanks

@@ -490,6 +490,23 @@ number of unique values and most frequently occurring values:
s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
s.describe()

Note that on a mixed-type DataFrame object, `describe` will restrict the summary to
include only numerical columns or, if none are, only categorical columns:

This comment has been minimized.

Copy link
@jreback

jreback Sep 16, 2014

Contributor

pls add similar note to v0.15.0.txt (and include this PR number as a refernce). put in the API section. Include a reference this doc section here.

@jreback jreback added this to the 0.15.0 milestone Sep 16, 2014

@bthyreau bthyreau force-pushed the bthyreau:betterdescribe branch Sep 17, 2014

frame.describe(include=['object'])
frame.describe(include=['number'])
frame.describe(include='all')

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Sep 30, 2014

Member

Can you refer to select_dtypes here? (as it is in that function that the include/exclude arguments are explained in more detail)

@jorisvandenbossche

View changes

pandas/core/generic.py Outdated
If multiple values have the highest count, then the
`count` and `most common` pair will be arbitrarily chosen from
among those with the highest count.

The include, exclude arguments are ignored for Series.
"""

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Sep 30, 2014

Member

Maybe add a

See also
--------
DataFrame.select_dtypes

to the docstring to have an easy link to that function

@jorisvandenbossche

View changes

pandas/core/generic.py Outdated
if self.ndim > 1:
if (include is None) and (exclude is None):

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Sep 30, 2014

Member

Is this needed here? As there is already a check for that above?

This comment has been minimized.

Copy link
@bthyreau

bthyreau Oct 1, 2014

Author Contributor

Well, indeed they're probably not necessary.
Actually, the whole functions codepaths were quite convoluted and crufty, and i wanted to refactor most of it, but i was reluctant to touch the existing code more than strictly necessary; for example breaking the (unspecified) order of fields in output. However i just noticed that there were inconsitencies about it anyway, so i'll propose the refactoring (see commit, and comment below). Sorry for making you review twice.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 30, 2014

@bthyreau added some more comments (sorry it took a while to look at).
Further I have another remark: is the '*' option needed? Is just using 'all' not enough? (it's only two characters more). I don't think * is used anywhere in pandas at the moment, and is just two aliases for the same.

@cpcloud: are you ok with including this? as you objected in the first place?

@bthyreau bthyreau force-pushed the bthyreau:betterdescribe branch Oct 1, 2014

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2014

  • Thank for your feedback. Following your comments about the condition-path, i finally decided to refactor that whole function. I was initially reluctant to change it, as it changed some of the (unspecified) order of fields in output; but it makes the code quite smaller and much clearer now.
    The only user-visible change is that the order of the output rows is slightly changed (in a more consistent way IMHO), and i had to alter an existing unittest which assumed an order.
  • Also, i wanted to confirm the behaviour using the new pd.Categorical data type, but i think select_dtypes() can't (explicitely) filter on it (yet?). Am i right ?
@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2014

select_dtypes handles categorical (pass in 'category' as the dtype)
let's drop using * and just use all to include all dtypes

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2014

@jreback ok thanks. Ok to drop "*" if you think it's inconsistant with the rest of pandas.
I'll fix the branch ASAP (fail due to minor discrepency as for the status of "bool" + i want to assess categories) and push it again soon

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2014

  • Ok, removed '*' as you requested.
    [Ok course, if it was only for me, "all" would be the default describe() behaviour (since I can't figure-out the motivation for silentely dropping random columns), so in real-world use, i will have to alias it anyway.]. I hope i made the doc clear enough to warn about this pitfall.

Thanks

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 2, 2014

@bthyreau well

describe is most useful for only numeric columns and that is the default
so it provides backwards compat

it does not drop random columns rather it by default selects numeric
(which is how most operations work FYI)

@@ -3658,6 +3658,16 @@ def abs(self):
The percentiles to include in the output. Should all
be in the interval [0, 1]. By default `percentiles` is
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
include, exclude : list-like, 'all', or None (default)
Specify the form of the returned result. Either:

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 2, 2014

Member

You have to leave an empty line after this line in order that the list will be rendered as a list (in the online html docstring pages). Or other option is to remove the line and just have the list items, then there does not need to be an empty line.

@jorisvandenbossche

View changes

pandas/core/generic.py Outdated
will include the count, unique, most common, and frequency of the
most common. Timestamps also include the first and last items.

For mixed dtypes, the index will be the union of the corresponding
output types. Non-applicable entries are filled with NaN.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 2, 2014

Member

This is still not correct I think as I commented before. Something from Note that on a mixed-type DataFrame object,describewill restrict the summary to include only numerical columns or, if none are, only categorical columns (your sentence from above) should be included here as well.

This comment has been minimized.

Copy link
@bthyreau

bthyreau Oct 2, 2014

Author Contributor

I added a top sentence at the top of the paragraph (hard to spot in the commit, i know), did you see it and think it's not enough ?
(because basically, all the paragraph now refers to output dtypes, since it's configurable)
Thanks !

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 2, 2014

Member

do you mean the first sentence in the Notes part 'The output DataFrame index depends on the requested dtypes:'?

Further, it is already stated in the explanation of the include/exclude keywords in the Parameters section, but I don't think it is a problem to repeat that here, better to be very explicit about it. The way it is now phrased is just not correct in the default case, but only if you specify something with include/exclude

This comment has been minimized.

Copy link
@bthyreau

bthyreau Oct 2, 2014

Author Contributor

Yes that part. I thought as a whole, the docstring made that clear. Ok, i'll update it to make it more explicit. Thanks

@jorisvandenbossche

View changes

pandas/core/generic.py Outdated
if name not in names:
names.append(name)
d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
return d

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 2, 2014

Member

You are already returning here, while there are still other important parts of the function below (eg the checking and processing of percentile_width and percentiles keywords). That does not seem right (and strange this does not fail in travis, or it is not tested of course)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 2, 2014

Member

but of course it calls the rest of the function when calling describe on the individual columns. But still, I think you can put the percentiles processing part before this, so the conversion of percentile_width to percentiles has to happen only once

This comment has been minimized.

Copy link
@bthyreau

bthyreau Oct 2, 2014

Author Contributor

That's because on DataFrame, it simply recursively calls on each of its Series. I just moved all the type-selection logic in a single place at the beginning (thanks to the existance of select_dtype)

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 2, 2014

Member

Yep, I figured that out in the meantime :-)
But still, the if's of if percentile_width and if percentiles can be put above that, no? In that case, for a dataframe, the conversion of percentile_width ot percentiles only has to happen once.

@jorisvandenbossche

View changes

pandas/core/generic.py Outdated
result += [lib.Timestamp(asint.min()),
lib.Timestamp(asint.max()),
lib.Timestamp(top), freq]
names += ['top', 'freq', 'first', 'last']

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Oct 2, 2014

Member

Why are you changing the order here?

This comment has been minimized.

Copy link
@bthyreau

bthyreau Oct 2, 2014

Author Contributor

This is originally a side effect of the refactoring (the way i sort out
the index on the general case). It changed the previously generated index
order, but had the advantage of keeping the order similar whatever the
types (e.g.. no more "# when mix of dateimte / obj the index gets
reordered."), and consistent with the docstring, which i saw as an advantage.
It was my feeling that the previous order was arbitrary, so i decided to made that
change as it.

If you think preserving this row order is necessary, i guess i could add some additional
logic for that case.

@bthyreau bthyreau force-pushed the bthyreau:betterdescribe branch Oct 3, 2014

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Oct 3, 2014

ok, refactored a bit to avoid the recomputation of parameters due to recursion. Thanks for pointing it out; As a bonus, the codepaths are shorter and easier to follow !
I also updated the docstring following your suggestion about more explicit mixed-type cases.

Thanks !

@jreback

View changes

pandas/core/generic.py Outdated
@@ -3751,42 +3767,45 @@ def describe_categorical_1d(data):

elif issubclass(data.dtype.type, np.datetime64):
asint = data.dropna().values.view('i8')
names += ['first', 'last', 'top', 'freq']

This comment has been minimized.

Copy link
@jreback

jreback Oct 4, 2014

Contributor

I suspect this does not work with Timedelta, but easy to fix, do something like:

if com.needs_i8_conversion(data):
    boxer = com.i8_boxer(data)
    asint = data.dropna().asi8

    names = ......

then just user boxer rather than lib.Timestamp

This comment has been minimized.

Copy link
@bthyreau

bthyreau Oct 5, 2014

Author Contributor

Thanks for pointing that out.

Hmm.. although your snippet works well, Timedelta seems closer to a numeric type than a object one: select_dtypes() filters it with np.number; and users would like to get a min/max/mean values, not first/last. So instead, i moved it to the describe_numeric codepath. Since it's new in 0.15, i think it's ok to do so even in the "default" describe call. However, std() doesn't make a lot of sense, so an alternative would be to create another specific index for timedelta.

ie. current code behave:

In [4]: df = DataFrame({'A_cat': Categorical(['foo', 'foo', 'bar'] * 8),
   ...:                 'B_obj': ['a', 'b', 'c', 'd'] * 6,
   ...:                 'C_int': np.arange(24, dtype='int64'),
   ...:                 'D_ts': tm.makeTimeSeries(24).index,
   ...:                 'E_tdelta': to_timedelta(np.arange(24)%20,"D")})

In [6]: df.describe(include="all")
Out[6]: 
       A_cat B_obj      C_int                 D_ts          E_tdelta
count     24    24  24.000000                   24                24
unique     2     4        NaN                   24               NaN
top      foo     d        NaN  2000-01-31 00:00:00               NaN
freq      16     6        NaN                    1               NaN
first    NaN   NaN        NaN  2000-01-03 00:00:00               NaN
last     NaN   NaN        NaN  2000-02-03 00:00:00               NaN
mean     NaN   NaN  11.500000                  NaN   8 days 04:00:00
std      NaN   NaN   7.071068                  NaN      5.354416e+14
min      NaN   NaN   0.000000                  NaN   0 days 00:00:00
25%      NaN   NaN   5.750000                  NaN   2 days 18:00:00
50%      NaN   NaN  11.500000                  NaN   7 days 12:00:00
75%      NaN   NaN  17.250000                  NaN  13 days 06:00:00
max      NaN   NaN  23.000000                  NaN  19 days 00:00:00

In [7]: df.describe()
Out[7]: 
           C_int          E_tdelta
count  24.000000                24
mean   11.500000   8 days 04:00:00
std     7.071068      5.354416e+14
min     0.000000   0 days 00:00:00
25%     5.750000   2 days 18:00:00
50%    11.500000   7 days 12:00:00
75%    17.250000  13 days 06:00:00
max    23.000000  19 days 00:00:00

This comment has been minimized.

Copy link
@jreback

jreback Oct 5, 2014

Contributor

this is fine (I am going to fix the std of the Timedelta right now actually). go ahead and make those slight changes (below) and we'll get this in.

This comment has been minimized.

Copy link
@jreback

jreback Oct 5, 2014

Contributor

on 2nd thought, even though std IS possible, its too tricky for right now (the problem is var is not allowed because it CAN overflow and not be represented by a Timedelta, so need special handling for std. punting for now). Put a wrapper around these ops to catch a TypeError and turn it into a Nan is the best soln I think.

This comment has been minimized.

Copy link
@bthyreau

bthyreau Oct 5, 2014

Author Contributor

hmm; in the example above, the float64 value ( 5.354416e+14 ) for std() is actually correct, and convert correctly with to_timedelta(df.std()) to 6 days 04:44:01.629993; [the var(), otoh, throw an OverflowError, but describe() doesn't use var()].

I'm not sure why the returned std value is not a TimeDelta64 but a float64; it looks like a problem with the std() call. Did you mean i should still catch that in describe() and recover ? (i may misunderstood since i'm not familiar with that part)

I fixed your other comment - Thanks !

@jreback

View changes

pandas/core/generic.py Outdated
def describe_1d(data, percentiles):
if data._is_numeric_mixed_type:
return describe_numeric_1d(data, percentiles)
elif issubclass(data.dtype.type, np.timedelta64):

This comment has been minimized.

Copy link
@jreback

jreback Oct 5, 2014

Contributor

use com.is_timedelta64_dtype here (and com.is_datetime64_dtype above (rather than the sub-class check which is older).

[series.quantile(x) for x in percentiles] +
[series.max()])
stat_index = (['count', 'mean', 'std', 'min'] +
[pretty_name(x) for x in percentiles] + ['max'])

This comment has been minimized.

Copy link
@jreback

jreback Oct 5, 2014

Contributor

.std() will now raise TypeError on timedeltas64[ns](after you rebase to master). So wrap the .std() call in something like:

def std(x):
    try:
         return x.std()
    except:
         return np.nan
@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2014

@bthyreau small change.

some .std() calls on a Timedelta will work (except for the returned inference). But that's the problem, some will also overflow. This stems from: std = sqrt(var) and var can be a number that is really big.

In [7]: max_int = np.iinfo(np.int64).max

In [8]: max_int
Out[8]: 9223372036854775807

In [9]: pd.Timedelta(max_int)
Out[9]: Timedelta('106751 days 23:47:16.854775')

In [10]: big_float = 3e19

In [12]: pd.Timedelta(big_float)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
OverflowError: Python int too large to convert to C long

In [13]: pd.Timedelta(np.sqrt(big_float))
Out[13]: Timedelta('0 days 00:00:05.477225')

Going to create an issue to fix this, but don't have time right now.
So just wrap the .std() as I show above to catch this (it will raise TypeError because it doesn't allow std/var ATM).

The complication is that std is allowd, but now var (but std CALLS var). So need to do this in a non-hacky way.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2014

see here: #8471

lmk when you make that change and push.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2014

@bthyreau of u can address this soon would be gr8
if not lmk - going to do a rc prob tonight and this goes in

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2014

I think #8476 will allow this to merge cleanly. so hold off

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2014

@bthyreau ok I think if u rebase this should work

@bthyreau

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2014

ok great. Rebasing and pushing now

More flexible describe() via include/exclude type filtering
This enhance describe()'s output via new include/exclude list arguments,
letting the user specify the dtypes to be summarized as output.
This provides an simple way to overcome the automatic type-filtering done
by default; it's also convenient with groupby().
Also includes documentation and changelog entries.

@bthyreau bthyreau force-pushed the bthyreau:betterdescribe branch to c2a1e18 Oct 5, 2014

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2014

merge via 6d3803d

thanks!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 6, 2014

side issue:

I think we may need a rounding option or something to make some of the default
display better looking. This comes up most often in timedeltas, but maybe other dtypes too.

This example is from your tests.

In [23]: df['td'] = df['ts'].diff()

In [24]: df.describe(include='all')
Out[24]: 
       catA catB       numC       numD                   ts                      td
count    24   24  24.000000  24.000000                   24                      23
unique    2    4        NaN        NaN                   24                     NaN
top     foo    d        NaN        NaN  2000-01-31 00:00:00                     NaN
freq     16    6        NaN        NaN                    1                     NaN
first   NaN  NaN        NaN        NaN  2000-01-03 00:00:00                     NaN
last    NaN  NaN        NaN        NaN  2000-02-03 00:00:00                     NaN
mean    NaN  NaN  11.500000  12.000000                  NaN  1 days 08:20:52.173913
std     NaN  NaN   7.071068   7.071068                  NaN  0 days 18:36:09.225425
min     NaN  NaN   0.000000   0.500000                  NaN         1 days 00:00:00
25%     NaN  NaN   5.750000   6.250000                  NaN         1 days 00:00:00
50%     NaN  NaN  11.500000  12.000000                  NaN         1 days 00:00:00
75%     NaN  NaN  17.250000  17.750000                  NaN         1 days 00:00:00
max     NaN  NaN  23.000000  23.500000                  NaN         3 days 00:00:00

In [25]: df['td'].describe(include='all')
Out[25]: 
count                        23
mean     1 days 08:20:52.173913
std      0 days 18:36:09.225425
min             1 days 00:00:00
25%             1 days 00:00:00
50%             1 days 00:00:00
75%             1 days 00:00:00
max             3 days 00:00:00
Name: td, dtype: object

In [26]: df['td'].describe(include='all')['mean']
Out[26]: Timedelta('1 days 08:20:52.173913')

You can 'fix' this by rounding (and you can check Timedelta(...).resolution to make sure that you are not cutting things off, e.g. Timedelta('1min 1s').resolution -> 's'

In [27]: df['td'].describe(include='all')['mean'].round('s')
Out[27]: Timedelta('1 days 08:20:52')

so prob need to have a wrapper for various functions (e.g. mean/std) to do this (for numeric like)
with an option.

@bthyreau If you think this is worthwhile, pls create a new issue.

@jorisvandenbossche
@cpcloud

@jreback jreback closed this Oct 6, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.