-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Include missing data count in pd.DataFrame.describe() #26102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include missing data count in pd.DataFrame.describe() #26102
Conversation
I would like to receive some feedback on the added summary statistic to Please let me know what everyone thinks, and once that's out of the way, I will proceed to change the documentation in the function to reflect the new enhancement. |
Missing is not a method that exists in our API so I don't think it should be added here. Size seems more appropriate to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you merge master and address comment?
Codecov Report
@@ Coverage Diff @@
## master #26102 +/- ##
===========================================
- Coverage 91.97% 40.73% -51.25%
===========================================
Files 175 175
Lines 52379 52412 +33
===========================================
- Hits 48178 21349 -26829
- Misses 4201 31063 +26862
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26102 +/- ##
===========================================
- Coverage 91.97% 40.73% -51.25%
===========================================
Files 175 175
Lines 52379 52412 +33
===========================================
- Hits 48178 21349 -26829
- Misses 4201 31063 +26862
Continue to review full report at Codecov.
|
@WillAyd Thanks for the feedback, let me know if you agree with the change now that I changed 'missing' to 'size'. If all good, I will go ahead and fix other tests that are failing due to the added functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am only +0 on adding this.
@@ -9804,9 +9804,10 @@ def describe(self, percentiles=None, include=None, exclude=None): | |||
|
|||
def describe_numeric_1d(series): | |||
stat_index = (['count', 'mean', 'std', 'min'] + | |||
formatted_percentiles + ['max']) | |||
formatted_percentiles + ['max', 'size']) | |||
d = ([series.count(), series.mean(), series.std(), series.min()] + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be the first arg
@@ -420,7 +420,7 @@ Other | |||
^^^^^ | |||
|
|||
- Removed unused C functions from vendored UltraJSON implementation (:issue:`26198`) | |||
|
|||
- Added enhancement to :func:`pd.DataFrame.describe` to include size as one of the summary statistics (:issue:`21689`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a separate subsection, its actually a fairly large API change.
can you update per comments |
Sorry for the long delay, I was away on vacation. I will be making the updates within the next couple of days. |
closing as stale, ping if you'd like to continue. |
@jreback @WillAyd I am interested in having the number of rows in the dataframe as an output in describe(). I assume that is what is being called |
@jbrockmendel @mroeschke I see you are active contributors right now. do you you have any objections to adding the number of rows in the dataframe as an output of describe() |
i have no opinion on describe. if mroeschke expresses an opinion ill agree with that, otherwise ill agree with jreback at +0. |
git diff upstream/master -u -- "*.py" | flake8 --diff