Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ENH: more flexible describe() + tests #8164
Conversation
|
pls show examples of the use case for this |
Rationale.When using dataframe of mixed type, ie. containing numeric values, string, categorical, etc, the current behaviour of describe() is a bit rough, as its only summarize over numerical columns only, or if none exists, over categorical columns only. With this change, describe() gets more flexible in its return form, which considerably smoothed my interactive data-analysis sessions. From the doc
ExampleAlthough real-life scenario are more convincing, here is small examples In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'colA': ['foo', 'foo', 'bar'] * 10,
...: 'colB': ['a', 'b', 'c', 'd', 'e'] * 6,
...: 'colC': np.arange(30), 'colD' : np.ones(30)})
In [4]: df.head()
Out[4]:
colA colB colC colD
0 foo a 0 1
1 foo b 1 1
2 bar c 2 1
3 foo d 3 1
4 foo e 4 1
# old behaviour pick columns based on the types of the dataframe. Not so nice.
In [6]: df.describe()
Out[6]:
colC colD
count 30.000000 30
mean 14.500000 1
std 8.803408 0
min 0.000000 1
25% 7.250000 1
50% 14.500000 1
75% 21.750000 1
max 29.000000 1
# using the new option, we can explictely asks to describe both types
In [8]: df.describe(return_type="categorical_only")
Out[8]:
colA colB
count 30 30
unique 2 5
top foo d
freq 20 6
In [9]: df.describe(return_type="numeric_only")
Out[9]:
colC colD
count 30.000000 30
mean 14.500000 1
std 8.803408 0
min 0.000000 1
25% 7.250000 1
50% 14.500000 1
75% 21.750000 1
max 29.000000 1
# using option "same" returns a df with similar-columns
In [11]: df.describe(return_type="same")
Out[11]:
colA colB colC colD
count 30 30 30 30
unique 2 5 NaN NaN
top foo d NaN NaN
freq 20 6 NaN NaN
mean NaN NaN 14.5 1
std NaN NaN 8.803408 0
min NaN NaN 0 1
25% NaN NaN 7.25 1
50% NaN NaN 14.5 1
75% NaN NaN 21.75 1
max NaN NaN 29 1
# one of my favorite pattern, using groupby:
In [13]: out = df.groupby("colA").describe(return_type="same")
In [14]: out.unstack(0)
colB colC colD
colA bar foo bar foo bar foo
count 10 20 10 20 10 20
unique 5 5 NaN NaN NaN NaN
top d d NaN NaN NaN NaN
freq 2 4 NaN NaN NaN NaN
mean NaN NaN 15.5 14 1 1
std NaN NaN 9.082951 8.855566 0 0
min NaN NaN 2 0 1 1
25% NaN NaN 8.75 6.75 1 1
50% NaN NaN 15.5 14 1 1
75% NaN NaN 22.25 21.25 1 1
max NaN NaN 29 28 1 1 |
|
Is their a reason you think that the above approach is better than:
(possibly adding |
jreback
added API Design Dtypes
labels
Sep 3, 2014
In [60]: df.dtypes
Out[60]:
colA object
colB object
colC int32
colD float64
dtype: object
In [56]: model_col = ["colA","colB"]
In [57]: df.loc[:,model_col].describe().loc[:,model_col]
Out[57]:
colA colB
count 30 30
unique 2 5
top foo d
freq 20 6
In [58]: model_col = ["colA","colB","colC"]
In [59]: df.loc[:,model_col].describe().loc[:,model_col]
Out[59]:
colA colB colC
count NaN NaN 30.000000
mean NaN NaN 14.500000
std NaN NaN 8.803408
min NaN NaN 0.000000
25% NaN NaN 7.250000
50% NaN NaN 14.500000
75% NaN NaN 21.750000
max NaN NaN 29.000000Here we have lost the count(),unique(), first(), etc. of colA and colB as soon as we introduced colC in the model However, it's not lost anymore when using return_type = "same" In [61]: df.loc[:,model_col].describe(return_type="same").loc[:,model_col]
Out[61]:
colA colB colC
count 30 30 30
unique 2 5 NaN
top foo d NaN
freq 20 6 NaN
mean NaN NaN 14.5
std NaN NaN 8.803408
min NaN NaN 0
25% NaN NaN 7.25
50% NaN NaN 14.5
75% NaN NaN 21.75
max NaN NaN 29
In the exemple before, even with Of course, it's more convincing with real-world large dataframe of mixed types, (as used e.g. in Psychology) where it's easy to mentally lost track of every columns and their types.
|
|
ok, your idea of 'same' is ok, but the API is not consistent with the pandas style. I would be ok with adding its a well-constructed and general API by @cpcloud |
|
ok, i implemented your suggested API, and it's indeed more flexible, while retaining the usability. Great ! Now it's possible to specify output form using include=/exclude= list. Some snippet below: >>> from pandas import Series
>>> from pandas import DataFrame
>>> import pandas.util.testing as tm
>>> import numpy as np
>>>
>>> df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8,
... 'catB': ['a', 'b', 'c', 'd'] * 6,
... 'numC': np.arange(24),
... 'numD': np.arange(24.) + .5,
... 'ts': tm.makeTimeSeries()[:24].index})
>>>
>>>
>>> df.describe(include=["number","object"])
catA catB numC numD
count 24 24 24.000000 24.000000
unique 2 4 NaN NaN
top foo d NaN NaN
freq 16 6 NaN NaN
mean NaN NaN 11.500000 12.000000
std NaN NaN 7.071068 7.071068
min NaN NaN 0.000000 0.500000
25% NaN NaN 5.750000 6.250000
50% NaN NaN 11.500000 12.000000
75% NaN NaN 17.250000 17.750000
max NaN NaN 23.000000 23.500000
>>> df.loc[:,:].describe() # as before
numC numD
count 24.000000 24.000000
mean 11.500000 12.000000
std 7.071068 7.071068
min 0.000000 0.500000
25% 5.750000 6.250000
50% 11.500000 12.000000
75% 17.250000 17.750000
max 23.000000 23.500000
>>>
>>> df.loc[:,['catA','catB','ts']].describe() # contains NaN, as before
catA catB ts
count 24 24 24
unique 2 4 24
first NaN NaN 2000-01-03 00:00:00
last NaN NaN 2000-02-03 00:00:00
top foo d 2000-01-31 00:00:00
freq 16 6 1
>>>
>>> df.describe(include=["object"])
catA catB
count 24 24
unique 2 4
top foo d
freq 16 6
>>> df.describe(include='*')
catA catB numC numD ts
count 24 24 24.000000 24.000000 24
unique 2 4 NaN NaN 24
top foo d NaN NaN 2000-01-31 00:00:00
freq 16 6 NaN NaN 1
first NaN NaN NaN NaN 2000-01-03 00:00:00
last NaN NaN NaN NaN 2000-02-03 00:00:00
mean NaN NaN 11.500000 12.000000 NaN
std NaN NaN 7.071068 7.071068 NaN
min NaN NaN 0.000000 0.500000 NaN
25% NaN NaN 5.750000 6.250000 NaN
50% NaN NaN 11.500000 12.000000 NaN
75% NaN NaN 17.250000 17.750000 NaN
max NaN NaN 23.000000 23.500000 NaN
>>>
>>> df.loc[:,['catA','catB']].describe(include='*')
catA catB
count 24 24
unique 2 4
top foo d
freq 16 6
>>> df.describe(include='*', exclude='XXX')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/core/generic.py", line 3681, in describe
raise ValueError("exclude must be None when include is '%s'" % include)
ValueError: exclude must be None when include is '*'
>>>
>>> df.groupby("catA").describe(include='*') # my favorite
catB numC numD ts
catA
bar count 8 8.000000 8.000000 8
unique 4 NaN NaN 8
top d NaN NaN 2000-01-31 00:00:00
freq 2 NaN NaN 1
first NaN NaN NaN 2000-01-05 00:00:00
last NaN NaN NaN 2000-02-03 00:00:00
mean NaN 12.500000 13.000000 NaN
std NaN 7.348469 7.348469 NaN
min NaN 2.000000 2.500000 NaN
25% NaN 7.250000 7.750000 NaN
50% NaN 12.500000 13.000000 NaN
75% NaN 17.750000 18.250000 NaN
max NaN 23.000000 23.500000 NaN
foo count 16 16.000000 16.000000 16
unique 4 NaN NaN 16
top d NaN NaN 2000-01-25 00:00:00
freq 4 NaN NaN 1
first NaN NaN NaN 2000-01-03 00:00:00
last NaN NaN NaN 2000-02-02 00:00:00
mean NaN 11.000000 11.500000 NaN
std NaN 7.118052 7.118052 NaN
min NaN 0.000000 0.500000 NaN
25% NaN 5.500000 6.000000 NaN
50% NaN 11.000000 11.500000 NaN
75% NaN 16.500000 17.000000 NaN
max NaN 22.000000 22.500000 NaN
>>> df.groupby("catA").describe(include=["object", "datetime", "number"], exclude=["float"])
catB numC ts
catA
bar count 8 8.000000 8
unique 4 NaN 8
top d NaN 2000-01-31 00:00:00
freq 2 NaN 1
first NaN NaN 2000-01-05 00:00:00
last NaN NaN 2000-02-03 00:00:00
mean NaN 12.500000 NaN
std NaN 7.348469 NaN
min NaN 2.000000 NaN
25% NaN 7.250000 NaN
50% NaN 12.500000 NaN
75% NaN 17.750000 NaN
max NaN 23.000000 NaN
foo count 16 16.000000 16
unique 4 NaN 16
top d NaN 2000-01-25 00:00:00
freq 4 NaN 1
first NaN NaN 2000-01-03 00:00:00
last NaN NaN 2000-02-02 00:00:00
mean NaN 11.000000 NaN
std NaN 7.118052 NaN
min NaN 0.000000 NaN
25% NaN 5.500000 NaN
50% NaN 11.000000 NaN
75% NaN 16.500000 NaN
max NaN 22.000000 NaN
Some Design decision minor points
|
jreback
and 1 other
commented on an outdated diff
Sep 5, 2014
| if self.ndim >= 3: | ||
| msg = "describe is not implemented on on Panel or PanelND objects." | ||
| raise NotImplementedError(msg) | ||
| + if (self.ndim > 1) and not (include is None and exclude is None): | ||
| + if (include == 'all' or include == '*'): | ||
| + if exclude != None: | ||
| + raise ValueError("exclude must be None when include is '%s'" % include) | ||
| + fself = self | ||
| + else: | ||
| + fself = self.select_dtypes(include=include, exclude=exclude) | ||
| + # simply apply for each column in this case | ||
| + ldesc = [fself[x].describe(percentile_width=percentile_width,\ | ||
| + percentiles=percentiles) \ | ||
| + for x in fself.columns] | ||
| + # merge individual outputs, preserving index order as possible | ||
| + names = [] | ||
| + ldesc_indexes = sorted([x.index for x in ldesc], key=len) |
jreback
Contributor
|
jreback
commented on an outdated diff
Sep 5, 2014
| if self.ndim >= 3: | ||
| msg = "describe is not implemented on on Panel or PanelND objects." | ||
| raise NotImplementedError(msg) | ||
| + if (self.ndim > 1) and not (include is None and exclude is None): | ||
| + if (include == 'all' or include == '*'): | ||
| + if exclude != None: |
jreback
Contributor
|
|
Introducing this kind of coupling hardly seems worth it for the "inconvenience" of having to call a single method. |
|
Also how is all and star in select dtypes different from just not calling the method? |
|
Ok, following the previous comment, i refrained from touching select_dtypes. I left the row-creation logic as it was, as i believed it was necessary, as discussed before. I'm quite happy with the current code. In addition to docstrings, I also added a brief overview in the main-doc. [I also have a potential short changelog doc, but i guess it's rude to commit it (e.g. in v0.15.txt) before knowing if you plan to merge this at all :)] |
bthyreau
changed the title from
more flexible describe() + tests to ENH: more flexible describe() + tests
Sep 11, 2014
jreback
commented on an outdated diff
Sep 11, 2014
| @@ -490,6 +490,19 @@ number of unique values and most frequently occurring values: | ||
| s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a']) | ||
| s.describe() | ||
| +Note that on a mixed-type DataFrame object, `describe` will restrict the summary to | ||
| +include only numerical columns or, if none are, only categorical columns. | ||
| +This behaviour can be refined via the ``include``/``exclude`` | ||
| +arguments. The special value ``all`` or ``*`` can also be used: | ||
| + | ||
| + | ||
| +.. ipython:: python | ||
| + | ||
| + frame = DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)}) | ||
| + frame.describe(include=['object']) |
|
|
jreback
commented on an outdated diff
Sep 11, 2014
| @@ -3635,6 +3635,16 @@ def abs(self): | ||
| The percentiles to include in the output. Should all | ||
| be in the interval [0, 1]. By default `percentiles` is | ||
| [.25, .5, .75], returning the 25th, 50th, and 75th percentiles. | ||
| + include, exclude : list-like, 'all', or None (default) | ||
| + Specify the form of the returned result. Either: | ||
| + - A list of dtypes or strings to be included/excluded | ||
| + To select all numeric types use numpy numpy.number. To select | ||
| + categorical objects use type object. See also the select_dtypes | ||
| + documentation. eg. df.describe(include=['O']) | ||
| + - If both are None (default), the result will include only |
|
|
jreback
commented on an outdated diff
Sep 11, 2014
| + ldesc = [] | ||
| + for name, col in fself.iteritems(): | ||
| + s = col.describe(percentile_width=percentile_width,\ | ||
| + percentiles=percentiles) | ||
| + s.name = name | ||
| + ldesc.append(s) | ||
| + # merge individual outputs, preserving index order as possible | ||
| + names = [] | ||
| + ldesc_indexes = sorted([x.index for x in ldesc], key=len) | ||
| + for idxnames in ldesc_indexes: | ||
| + for name in idxnames: | ||
| + if name not in names: | ||
| + names.append(name) | ||
| + d = pd.concat(ldesc, axis=1).loc[names] | ||
| + return d | ||
| + | ||
| if percentile_width is not None and percentiles is not None: |
jreback
Contributor
|
|
Ok, i updated the main doc and docstring following your request. As for the rationale of those loops, this is necessary to compute the order of the row axis (statistics-list). The describe() functions must output results immediately practical for users, but without the loop, as you showed, percentiles are not surrounded by min/max; count is at the middle, etc; due to the default lexsorting logic of Index operations. In detail, in the snippet below:
That's why i gave-up using apply in this case. I also experimented other way, such as the various Index manipulation functions, or pre-computing the rows-keys, etc. but it didn't improve much. Note also that, as a side effect, the whole function itself seems to be slightly faster than the sole logicless apply. def test1(fself, percentile_width = None, percentiles = []):
ldesc = []
for name, col in fself.iteritems():
s = col.describe(percentile_width=percentile_width,\
percentiles=percentiles)
s.name = name
ldesc.append(s)
# set a convenient order for rows
names = []
ldesc_indexes = sorted([x.index for x in ldesc], key=len)
for idxnames in ldesc_indexes:
for name in idxnames:
if name not in names:
names.append(name)
d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
return d
In [84]: %timeit test1(df, percentiles=[.42])
100 loops, best of 3: 5.4 ms per loop
In [85]: %timeit df.apply(lambda x : x.describe(percentile_width = None, percentiles=[.42]))
100 loops, best of 3: 6.59 ms per loopSame pattern on a wider (24, 500)-shaped df: 458 ms vs 499 ms |
jreback
commented on an outdated diff
Sep 13, 2014
| + if exclude != None: | ||
| + raise ValueError("exclude must be None when include is '%s'" % include) | ||
| + fself = self | ||
| + else: | ||
| + fself = self.select_dtypes(include=include, exclude=exclude) | ||
| + # simply apply for each column in this case | ||
| + ldesc = [] | ||
| + for name, col in fself.iteritems(): | ||
| + s = col.describe(percentile_width=percentile_width,\ | ||
| + percentiles=percentiles) | ||
| + s.name = name | ||
| + ldesc.append(s) | ||
| + # set a convenient order for rows | ||
| + names = [] | ||
| + ldesc_indexes = sorted([x.index for x in ldesc], key=len) | ||
| + for idxnames in ldesc_indexes: |
|
|
jreback
and 1 other
commented on an outdated diff
Sep 13, 2014
| if self.ndim >= 3: | ||
| msg = "describe is not implemented on on Panel or PanelND objects." | ||
| raise NotImplementedError(msg) | ||
| + if (self.ndim > 1) and not (include is None and exclude is None): | ||
| + if (include == 'all' or include == '*'): | ||
| + if exclude != None: | ||
| + raise ValueError("exclude must be None when include is '%s'" % include) | ||
| + fself = self | ||
| + else: | ||
| + fself = self.select_dtypes(include=include, exclude=exclude) | ||
| + # simply apply for each column in this case | ||
| + ldesc = [] | ||
| + for name, col in fself.iteritems(): |
bthyreau
Contributor
|
|
Well, it's only style, but if you want back the list-comprehension, then fine; while at it, to make some actual improvement, i changed the behaviour on Series so that the index name got filed at creation time. See commit. |
|
pls squash to a single commit |
jreback
commented on an outdated diff
Sep 14, 2014
| if self.ndim >= 3: | ||
| msg = "describe is not implemented on on Panel or PanelND objects." | ||
| raise NotImplementedError(msg) | ||
| + if (self.ndim > 1) and not (include is None and exclude is None): | ||
| + if (include == 'all' or include == '*'): | ||
| + if exclude != None: | ||
| + raise ValueError("exclude must be None when include is '%s'" % include) | ||
| + fself = self | ||
| + else: | ||
| + fself = self.select_dtypes(include=include, exclude=exclude) | ||
| + ldesc = [col.describe(percentile_width=percentile_width, | ||
| + percentiles=percentiles) for _, col in fself.iteritems()] | ||
| + # set a convenient order for rows |
|
|
jreback
and 1 other
commented on an outdated diff
Sep 14, 2014
| @@ -1012,6 +1012,85 @@ def test_describe_objects(self): | ||
| assert_frame_equal(df[['C1', 'C3']].describe(), df[['C3']].describe()) | ||
| assert_frame_equal(df[['C2', 'C3']].describe(), df[['C3']].describe()) | ||
| + def test_describe_typefiltering(self): | ||
| + df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, | ||
| + 'catB': ['a', 'b', 'c', 'd'] * 6, | ||
| + 'numC': np.arange(24), |
bthyreau
Contributor
|
jreback
commented on the diff
Sep 14, 2014
| @@ -1012,6 +1012,85 @@ def test_describe_objects(self): | ||
| assert_frame_equal(df[['C1', 'C3']].describe(), df[['C3']].describe()) | ||
| assert_frame_equal(df[['C2', 'C3']].describe(), df[['C3']].describe()) | ||
| + def test_describe_typefiltering(self): | ||
| + df = DataFrame({'catA': ['foo', 'foo', 'bar'] * 8, |
|
|
|
Ok, i squashed all commits into one, which updates the code and the main doc. |
jreback
commented on the diff
Sep 16, 2014
| @@ -490,6 +490,23 @@ number of unique values and most frequently occurring values: | ||
| s = Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a']) | ||
| s.describe() | ||
| +Note that on a mixed-type DataFrame object, `describe` will restrict the summary to | ||
| +include only numerical columns or, if none are, only categorical columns: |
jreback
Contributor
|
jreback
added this to the
0.15.0
milestone
Sep 16, 2014
|
Ok, done & rebased. |
|
@jorisvandenbossche @cpcloud have a look pls |
|
@cpcloud can you review |
jorisvandenbossche
and 1 other
commented on an outdated diff
Sep 30, 2014
| will include the count, unique, most common, and frequency of the | ||
| most common. Timestamps also include the first and last items. | ||
| + For mixed dtypes, the index includes the union of the corresponding | ||
| + output types. Non-applicable entries are filled with NaN. |
jorisvandenbossche
Owner
|
jorisvandenbossche
commented on the diff
Sep 30, 2014
| +include only numerical columns or, if none are, only categorical columns: | ||
| + | ||
| +.. ipython:: python | ||
| + | ||
| + frame = DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)}) | ||
| + frame.describe() | ||
| + | ||
| +This behaviour can be controlled by providing a list of types as ``include``/``exclude`` | ||
| +arguments. The special value ``all`` (or ``*``) can also be used: | ||
| + | ||
| +.. ipython:: python | ||
| + | ||
| + frame.describe(include=['object']) | ||
| + frame.describe(include=['number']) | ||
| + frame.describe(include='all') | ||
| + |
jorisvandenbossche
Owner
|
jorisvandenbossche
commented on an outdated diff
Sep 30, 2014
| If multiple values have the highest count, then the | ||
| `count` and `most common` pair will be arbitrarily chosen from | ||
| among those with the highest count. | ||
| + | ||
| + The include, exclude arguments are ignored for Series. | ||
| """ |
jorisvandenbossche
Owner
|
jorisvandenbossche
and 1 other
commented on an outdated diff
Sep 30, 2014
| if self.ndim > 1: | ||
| + if (include is None) and (exclude is None): |
jorisvandenbossche
Owner
|
|
@bthyreau added some more comments (sorry it took a while to look at). @cpcloud: are you ok with including this? as you objected in the first place? |
|
|
select_dtypes handles categorical (pass in 'category' as the dtype) |
|
@jreback ok thanks. Ok to drop "*" if you think it's inconsistant with the rest of pandas. |
Thanks |
|
@bthyreau well describe is most useful for only numeric columns and that is the default it does not drop random columns rather it by default selects numeric |
jorisvandenbossche
commented on the diff
Oct 2, 2014
| @@ -3658,6 +3658,16 @@ def abs(self): | ||
| The percentiles to include in the output. Should all | ||
| be in the interval [0, 1]. By default `percentiles` is | ||
| [.25, .5, .75], returning the 25th, 50th, and 75th percentiles. | ||
| + include, exclude : list-like, 'all', or None (default) | ||
| + Specify the form of the returned result. Either: |
jorisvandenbossche
Owner
|
jorisvandenbossche
and 1 other
commented on an outdated diff
Oct 2, 2014
| will include the count, unique, most common, and frequency of the | ||
| most common. Timestamps also include the first and last items. | ||
| + For mixed dtypes, the index will be the union of the corresponding | ||
| + output types. Non-applicable entries are filled with NaN. |
jorisvandenbossche
Owner
|
jorisvandenbossche
and 1 other
commented on an outdated diff
Oct 2, 2014
| + raise ValueError(msg) | ||
| + fself = self | ||
| + else: | ||
| + fself = self.select_dtypes(include=include, exclude=exclude) | ||
| + | ||
| + ldesc = [col.describe(percentile_width=percentile_width, | ||
| + percentiles=percentiles) for _, col in fself.iteritems()] | ||
| + # set a convenient order for rows | ||
| + names = [] | ||
| + ldesc_indexes = sorted([x.index for x in ldesc], key=len) | ||
| + for idxnames in ldesc_indexes: | ||
| + for name in idxnames: | ||
| + if name not in names: | ||
| + names.append(name) | ||
| + d = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1) | ||
| + return d |
jorisvandenbossche
Owner
|
jorisvandenbossche
and 1 other
commented on an outdated diff
Oct 2, 2014
| @@ -3751,42 +3792,19 @@ def describe_categorical_1d(data): | ||
| elif issubclass(data.dtype.type, np.datetime64): | ||
| asint = data.dropna().values.view('i8') | ||
| - names += ['first', 'last', 'top', 'freq'] | ||
| - result += [lib.Timestamp(asint.min()), | ||
| - lib.Timestamp(asint.max()), | ||
| - lib.Timestamp(top), freq] | ||
| + names += ['top', 'freq', 'first', 'last'] |
bthyreau
Contributor
|
|
ok, refactored a bit to avoid the recomputation of parameters due to recursion. Thanks for pointing it out; As a bonus, the codepaths are shorter and easier to follow ! Thanks ! |
jreback
and 1 other
commented on an outdated diff
Oct 4, 2014
| @@ -3751,42 +3767,45 @@ def describe_categorical_1d(data): | ||
| elif issubclass(data.dtype.type, np.datetime64): | ||
| asint = data.dropna().values.view('i8') | ||
| - names += ['first', 'last', 'top', 'freq'] |
jreback
Contributor
|
jreback
commented on an outdated diff
Oct 5, 2014
| - return pd.Series(result, index=names) | ||
| - | ||
| - if is_object: | ||
| - if data.ndim == 1: | ||
| - return describe_categorical_1d(self) | ||
| + names += ['top', 'freq', 'first', 'last'] | ||
| + result += [lib.Timestamp(top), freq, | ||
| + lib.Timestamp(asint.min()), | ||
| + lib.Timestamp(asint.max())] | ||
| + | ||
| + return pd.Series(result, index=names, name=data.name) | ||
| + | ||
| + def describe_1d(data, percentiles): | ||
| + if data._is_numeric_mixed_type: | ||
| + return describe_numeric_1d(data, percentiles) | ||
| + elif issubclass(data.dtype.type, np.timedelta64): |
jreback
Contributor
|
jreback
commented on the diff
Oct 5, 2014
| @@ -3733,10 +3747,12 @@ def pretty_name(x): | ||
| return '%.1f%%' % x | ||
| def describe_numeric_1d(series, percentiles): | ||
| - return ([series.count(), series.mean(), series.std(), | ||
| - series.min()] + | ||
| - [series.quantile(x) for x in percentiles] + | ||
| - [series.max()]) | ||
| + stat_index = (['count', 'mean', 'std', 'min'] + | ||
| + [pretty_name(x) for x in percentiles] + ['max']) |
jreback
Contributor
|
|
@bthyreau small change. some
Going to create an issue to fix this, but don't have time right now. The complication is that std is allowd, but now var (but std CALLS var). So need to do this in a non-hacky way. |
|
see here: pydata#8471 lmk when you make that change and push. |
|
@bthyreau of u can address this soon would be gr8 |
|
I think #8476 will allow this to merge cleanly. so hold off |
|
@bthyreau ok I think if u rebase this should work |
|
ok great. Rebasing and pushing now |
|
merge via 6d3803d thanks! |
|
side issue: I think we may need a rounding option or something to make some of the default This example is from your tests.
You can 'fix' this by rounding (and you can check Timedelta(...).resolution to make sure that you are not cutting things off, e.g.
so prob need to have a wrapper for various functions (e.g. mean/std) to do this (for numeric like) @bthyreau If you think this is worthwhile, pls create a new issue. |
bthyreau commentedSep 3, 2014
this patch adds a return_type keyword argument to describe() to make it more
flexible to use on mixed-type dataframe. User can now select among returning
numeric, categorical, or both, as well as 'auto' (previous behaviour, default),
and 'same', which keep columns identical (useful, e.g. with groupby())