Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: accept list-like percentiles in describe (WIP) #7088

Merged
merged 1 commit into from May 14, 2014

Conversation

Projects
None yet
3 participants
@TomAugspurger
Copy link
Contributor

commented May 9, 2014

Closes #4196

This is for frames. I'm going to refactor this into generic since to cover series / frames.

A couple questions:

  • API wise, I added a new kwarg percentiles. For backwards compat, we keep the percentile_width kwarg. I changed the default percentile_width from 50 to None (but the default output is the same) Cases:
    1. You specify percentile_width and percentiles -> ValueError
    2. You specify neither percentile_width nor percentiles -> percentile_width set to 50 and same as before
    3. You specify one of those, everything goes as expected.
  • I'm accepting either decimals (e.g. [0.25, .5, .75]) or percentages (e.g. [25, 50, 75]). Those two are equivalent in output. Should I move this logic to the .quantile to be more consistent?
  • I'm choosing to not sort the provided percentiles. It's easy for the user to sort, but hard for them to unsort if for some reason they want it in a specific order. I'd rather not add another kwarg.

@jreback jreback added this to the 0.14.1 milestone May 9, 2014

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2014

I'm close on this... a quick question though. There's also a describe for object types (strs or datetime). Notice the different index order for the last one (this is all current behavior):

# strs only
In [31]: df2 = pd.DataFrame({"C2": ['a', 'a', 'b', 'c']})

In [32]: df2.describe()
Out[32]: 
       C2
count   4
unique  3
top     a
freq    2

[4 rows x 1 columns]

# datetime only
In [28]: df = DataFrame({"C1": pd.date_range('2010-01-01', periods=4, freq='D')})

In [29]: df
Out[29]: 
          C1
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04

[4 rows x 1 columns]

In [30]: df.describe()
Out[30]: 
                         C1
count                     4
unique                    4
first   2010-01-01 00:00:00
last    2010-01-04 00:00:00
top     2010-01-01 00:00:00
freq                      1

[6 rows x 1 columns]

# mix of timestamp and strs
In [33]: df = pd.concat([df, df2], axis=1)

In [35]: df.describe()
Out[35]: 
                         C1   C2
count                     4    4
first   2010-01-01 00:00:00  NaN
freq                      1    2
last    2010-01-04 00:00:00  NaN
top     2010-01-01 00:00:00    a
unique                    4    3

[6 rows x 2 columns]

So the index gets sorted. Is it worth breaking backwards compat to keep the index in a sensible order? I'm not sure.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 10, 2014

yeh these should be in a sensible order I think
u can put it in API changes

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2014

Moved to generic (I'm not sure it was worth it; the code got pretty messy with a bunch of if / else.), updated docs. Should be good when travis says so.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 10, 2014

ok, in theory ndim>=3 should work via .apply FYI

you can put tests in test_generic.py (you can do specific tests or have it create them generically)

@jreback

View changes

pandas/core/generic.py Outdated
# dtypes: numeric only, numeric mixed, objects only
data = self._get_numeric_data()
if self.ndim > 1:
if len(data.columns) == 0:

This comment has been minimized.

Copy link
@jreback

jreback May 10, 2014

Contributor

do this as len(data._info_axis)

@jreback

View changes

pandas/core/generic.py Outdated
def describe_categorical_1d(data):
if data.dtype == object:
names = ['count', 'unique']
objcounts = compat.Counter(data.dropna().values)

This comment has been minimized.

Copy link
@jreback

jreback May 10, 2014

Contributor

isn't this value_counts?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger May 10, 2014

Author Contributor

Close. Changing it to use value_counts() breaks a test since I guess Counter keeps track of the
order for ties (same number of counts), but value_counts() doesn't.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger May 11, 2014

Author Contributor

Looks like I was wrong about Counter tracking order, it's also arbitrary in the case of ties. I changed this to use value_counts and added a note about how the order is arbitrary in the case of ties.

@jreback

View changes

pandas/core/generic.py Outdated
result = [data.count(), len(objcounts)]
if result[1] > 0:
names += ['top', 'freq']
top, freq = objcounts.most_common(1)[0]

This comment has been minimized.

Copy link
@jreback

jreback May 10, 2014

Contributor

this is just value_counts().iloc[0], right?

@jreback

View changes

pandas/core/generic.py Outdated
most common. Timestamps also include the first and last items.
"""

@Appender(_shared_docs['describe'] % _shared_doc_kwargs)

This comment has been minimized.

Copy link
@jreback

jreback May 10, 2014

Contributor

maybe just have ndim>=3 raise for now (or do an apply)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger May 10, 2014

Author Contributor

I'm just going to raise for now. I'll think about what would be included in describe for a Panel and what the output would look like.

This comment has been minimized.

Copy link
@jreback

jreback May 10, 2014

Contributor

sounds good

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2014

@jreback was there anything else you saw here? I think it''s ready.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 11, 2014

didn't realize their is an argument percentile_width

I think u should just rename it to percentiles and make it what u have for percentiles (and if it's a scalar then the meaning is unchanged)

I think too confusing with that argument (which is prob not used much at all) - yours is much more useful

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2014

Should we do any warning / deprecation? I should be able to handle that very easily.

On May 11, 2014, at 3:46 PM, "jreback" <notifications@github.commailto:notifications@github.com> wrote:

didn't realize their is an argument percentile_width

I think u should just rename it to percentiles and make it what u have for percentiles (and if it's a scalar then the meaning is unchanged)

I think too confusing with that argument (which is prob not used much at all) - yours is much more useful


Reply to this email directly or view it on GitHubhttps://github.com//pull/7088#issuecomment-42782700.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 11, 2014

sure why don't I deprecate perentile_width and replace with percentile

otherwise functionality is the same

@jsexauer jsexauer referenced this pull request May 12, 2014

Open

DEPR: deprecations from prior versions #6581

0 of 95 tasks complete
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2014

Added a note about this deprecation to ##6581.

Anything else?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 12, 2014

@jorisvandenbossche Does my deprecation note here look ok? That's how the numpy guide said to do it for objects. I assumed it was similar for keyword arguments.

@jorisvandenbossche

View changes

pandas/core/generic.py Outdated
@@ -3478,6 +3478,152 @@ def _convert_timedeltas(x):

return np.abs(self)

_shared_docs['describe'] = """
Generate various summary statistics of self, excluding

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 12, 2014

Member

I think the of self is not very clear for people not knowing the self-concept, maybe just leave it out?

@jorisvandenbossche

View changes

pandas/core/generic.py Outdated
raise NotImplementedError(msg)

if percentile_width is not None and percentiles is not None:
msg = "One of percentile_width and percentiles must be None"

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche May 12, 2014

Member

Maybe better to say Cannot both specify 'percentile_width' and 'percentiles'?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented May 12, 2014

Some comments:

  • is it necessary that both the percentage-style (50) and number-style (0.5) are allowed? This seems to make it a bit complex to me (what if you use the first style, but want to give 0.5%, then you have to use the other style and it becomes confusing I think)
  • the deprecation of percentile_width is not in the release notes
  • About the deprecation note in the docstring. I think this is OK, but another option is to keep the parameter in the parameter section, but start its explanation with like 'Deprecated and will be removed. Use instead ... ' (the numpy docstring standard does not really mention this how to do, I think in pandas we mostly mention it in the parameter section itself instead of a seperate note (although we maybe mostly forget it ...), eg http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html?highlight=deprecated).
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2014

@jorisvandenbossche thanks for the comments.

I was hoping that accepting both percentages and raw decimals would be less confusing, since I can never remember which we expect.

I actually had a longer reply written and then I realized why it was confusing. I'll switch it back to just expecting decimals between [0, 1], which is at least consistent with quantile.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented May 13, 2014

@TomAugspurger I think quantile was originally designed this way to match R, however, now it dismatches np.percentile which uses [0,100]. In any case, too late to change now in pandas I think, and indeed most important to be consistent within pandas with quantile.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented May 13, 2014

BTW, nice and informative FutureWarning! +1

@jreback jreback modified the milestones: 0.14.0, 0.14.1 May 13, 2014

TomAugspurger added a commit that referenced this pull request May 14, 2014

Merge pull request #7088 from TomAugspurger/describe-quantiles
ENH/API: accept list-like percentiles in describe (WIP)

@TomAugspurger TomAugspurger merged commit f26e668 into pandas-dev:master May 14, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented May 14, 2014

@TomAugspurger why are you using Counter rather than pd.core.algorithm.value_counts() again?

Counter DOES not sort, while value_counts does.

On windows test_generic/test_describe_object is failing on the datetimes because of the arbitrary order from Counter (as the counts are all 1 for some reason it picks the 2nd one).

======================================================================
FAIL: test_describe_objects (pandas.tests.test_generic.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\tests\test_generic.py", line 997, in test_describe_objects
    assert_frame_equal(result, expected)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\util\testing.py", line 573, in assert_frame_equal
    check_exact=check_exact)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\util\testing.py", line 520, in assert_series_equal
    assert_almost_equal(left.values, right.values, check_less_precise)
  File "testing.pyx", line 58, in pandas._testing.assert_almost_equal (pandas\src\testing.c:2465)
  File "testing.pyx", line 93, in pandas._testing.assert_almost_equal (pandas\src\testing.c:1793)
  File "testing.pyx", line 139, in pandas._testing.assert_almost_equal (pandas\src\testing.c:2338)
AssertionError: Timestamp('2010-01-02 00:00:00') != Timestamp('2010-01-01 00:00:00')

----------------------------------------------------------------------
Ran 6971 tests in 194.439s

FAILED (SKIP=130, failures=1)
2.7\pandas\core\generic.py(3576)describe_categorical_1d()
-> if data.dtype == object:
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3585)describe_categorical_1d()
-> elif issubclass(data.dtype.type, np.datetime64):
(Pdb)
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3586)describe_categorical_1d()
-> names = ['count', 'unique']
(Pdb) p data
0   2010-01-01
1   2010-01-02
2   2010-01-03
3   2010-01-04
Name: C1, dtype: datetime64[ns]
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3587)describe_categorical_1d()
-> asint = data.dropna().values.view('i8')
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3588)describe_categorical_1d()
-> objcounts = compat.Counter(asint)
(Pdb) p asint
array([1262304000000000000, 1262390400000000000, 1262476800000000000,
       1262563200000000000], dtype=int64)
(Pdb) l
3583                        result += [top, freq]
3584
3585                elif issubclass(data.dtype.type, np.datetime64):
3586                    names = ['count', 'unique']
3587                    asint = data.dropna().values.view('i8')
3588 ->                 objcounts = compat.Counter(asint)
3589                    result = [data.count(), len(objcounts)]
3590                    if result[1] > 0:
3591                        top, freq = objcounts.most_common(1)[0]
3592                        names += ['first', 'last', 'top', 'freq']
3593                        result += [lib.Timestamp(asint.min()),
(Pdb) p compat.Counter(asint)
Counter({1262390400000000000: 1, 1262563200000000000: 1, 1262304000000000000: 1, 1262476800000000000: 1})
(Pdb) p asint
array([1262304000000000000, 1262390400000000000, 1262476800000000000,
       1262563200000000000], dtype=int64)
(Pdb) p pd.algorithms.value_counts
*** AttributeError: AttributeError("'module' object has no attribute 'algorithms'",)
(Pdb) p pd.core.algorithms.value_counts
<function value_counts at 0x00000000076FDBA8>
(Pdb) p pd.core.algorithms.value_counts(asint)
1262304000000000000    1
1262563200000000000    1
1262476800000000000    1
1262390400000000000    1
dtype: int64
(Pdb) p pd.core.algorithms.value_counts(asint,sort=True)
1262304000000000000    1
1262563200000000000    1
1262476800000000000    1
1262390400000000000    1
dtype: int64
(Pdb)
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2014

Ahh I missed that one. I'll switch it over to use value counts and fix the test so that it isn't ambiguous.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 14, 2014

awesome just put up a pr and I can test

@jreback jreback referenced this pull request Jul 24, 2016

Open

DEPR: deprecations log for removed issues #13777

120 of 120 tasks complete

@TomAugspurger TomAugspurger deleted the TomAugspurger:describe-quantiles branch Nov 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.