Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: show percentiles in timestamp describe (#30164) #30209

Merged
merged 3 commits into from
Jan 20, 2020

Conversation

david-cortes
Copy link
Contributor

@david-cortes david-cortes commented Dec 11, 2019

closes #30164

When creating a pandas Series of timestamps and calling describe on it, it will not show the percentiles of the data, even if these are specified as arguments (s.describe(percentiles = [0.25, 0.5, 0.75])).

This PR solves the issue by introducing a new describe logic for timestamps that would:

  • Add the percentiles in the same way as for numeric.
  • Rename first and last to min and max in order to match numeric types.
  • Add the mean of the series.
  • Show datetime columns alongside numerics by default in the describe method for DataFrames. (just realized there were other issues in which this was deemed to be the desirable behavior)

After this, it more or less matches the functionality of R's summary on POSIXct and Date types.

Why I think this is a good idea: I oftentimes find myself trying to inspect tables of mixed types, and want to get a quick overview of which kind of ranges and variations are there in each column, including timestamps. Currently this is very inconvenient in pandas, and they are not displayed alongside numeric columns by default and don't match the names if passing include = "all", which is what I want to see when I call DataFrame.describe.

Examples:

import numpy as np, pandas as pd
dt_list = [
    "2019-01-01 00:01:00",
    "2019-01-01 00:02:00",
    "2019-01-01 00:02:10",
    "2019-01-01 00:10:01",
    "2019-01-01 00:11:00"
]
s = pd.to_datetime(pd.Series(dt_list))
s.describe()
pd.DataFrame({
    "col1" : s,
    "col2" : np.arange(s.shape[0])
}).describe(include = "all")

Update: corrected the coding style to comply with black and pass the automatic test.

@gfyoung gfyoung added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations Datetime Datetime data dtype labels Dec 12, 2019
@@ -9806,11 +9788,28 @@ def describe_categorical_1d(data):

return pd.Series(result, index=names, name=data.name, dtype=dtype)

def describe_timestamp_1d(data):
tz = data.dt.tz
Copy link
Member

@gfyoung gfyoung Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number above this line.

@pep8speaks
Copy link

pep8speaks commented Dec 12, 2019

Hello @david-cortes! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-20 21:54:50 UTC

@david-cortes
Copy link
Contributor Author

david-cortes commented Dec 12, 2019

Added the issue number in the code, and also modified it to handle cases with empty inputs. Modified the tests on describe in Series and DataFrame too.

EDIT: updated yet again for code style checks.

EDIT2: updated again for alphabetical order of imports.

@MarcoGorelli
Copy link
Member

Hi @david-cortes - seems like one of the Azure jobs is failing (but the build isn't there anymore so we can't check it), and that there are now some conflicts with master. Is this something you're still working on?

@david-cortes david-cortes reopened this Jan 18, 2020
@david-cortes
Copy link
Contributor Author

@MarcoGorelli : No, not working on anything else here, the PR is finished. Solved the merge conflicts now (they were caused by some tests moving to separate files).

From what I get of the automatic checks that were failing, they were caused by something unrelated to this PR.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-cortes can you create a new issue to move all of the describe helpers to pandas/io/alogorithms.py (or really we should just create pandas/io/algos/describe.py) and then move everything else.

can do before or after this PR

def describe_timestamp_1d(data):
# GH-30164
tz = data.dt.tz
asint = data.dropna().values.view("i8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the .values.view, just use len(data.dropna()) for the shape check

data.min() and so on all work and return NaT correctly, none of these gymnastics are needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that's what the current code does, but let's simplify

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also needs a whats note for 1.1, put in enhancements

david-cortes added a commit to david-cortes/pandas that referenced this pull request Jan 20, 2020
@david-cortes
Copy link
Contributor Author

david-cortes commented Jan 20, 2020

Yes, after the latest changes the internal datetime methods now work just fine with empty inputs and missing values. Changed the code to use those instead, and added the changes to the 1.1.0 what’s new entries. Will now open another one for moving the describe methods to a new file.

EDIT: corrected for linting now.

@david-cortes
Copy link
Contributor Author

@jreback : Did you mean to say to move them to pandas/core/algorithms.py? Would seem out of place in io.

@jreback
Copy link
Contributor

jreback commented Jan 20, 2020

@jreback : Did you mean to say to move them to pandas/core/algorithms.py? Would seem out of place in io.

yes though i’d like to create

pandas/core/algos/describe.py

and split up algorithms (so this can. e a step there)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small comment ping on green.

@@ -18,6 +18,8 @@ Enhancements
Other enhancements
^^^^^^^^^^^^^^^^^^

- :meth:`Series.describe` will now show distribution percentiles for ``datetime`` dtypes, statistics ``first`` and ``last``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to api changes section.

3,
4,
],
"s1": [5, 2, 0, 1, 2, 3, 4, 1.581139],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have sufficient tests for timedeltas?

can you create a test for Period (which likely don't work), if you'd just xfail it (alt if you'd create an issue)

@jreback jreback added this to the 1.1 milestone Jan 20, 2020
@david-cortes
Copy link
Contributor Author

Moved the changes in the what’s new and added tests for Period and Timedelta. The describe for Timedelta is already handled as a numeric dtype, and the describe for Period as object dtype, so those are not affected by this PR.

@@ -29,6 +29,36 @@ def test_describe(self):
)
tm.assert_series_equal(result, expected)

s = Series(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for now, in a followon we want to parameterize these.

@jreback jreback merged commit f1bbb21 into pandas-dev:master Jan 20, 2020
@jreback
Copy link
Contributor

jreback commented Jan 20, 2020

thanks @david-cortes

if you woudn't mind parameterize / separating out the other dtypes tests would be great (could be combined with moving the code out from pandas/core/algorithms.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request: show datetime percentiles in 'describe' method
5 participants