Include missing data count in pd.DataFrame.describe() #26102

alexander-ponomaroff · 2019-04-15T16:31:16Z

closes Include missing data count in pd.Dataframe.describe method #21689
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

alexander-ponomaroff · 2019-04-15T16:37:36Z

I would like to receive some feedback on the added summary statistic to describe(). It's between having missing, which counts the number of missing values, and having length, which will include missing values and if the user wants to get the missing summary, they will have to subtract count from length. I bring this up because @jreback asked for length in the issue, but the creator of the issue asked for missing. In my opinion, I think that the way I currently did it with missing is better.

Please let me know what everyone thinks, and once that's out of the way, I will proceed to change the documentation in the function to reflect the new enhancement.

WillAyd · 2019-04-15T18:47:51Z

Missing is not a method that exists in our API so I don't think it should be added here. Size seems more appropriate to me

WillAyd

Can you merge master and address comment?

codecov · 2019-04-27T17:14:52Z

Codecov Report

Merging #26102 into master will decrease coverage by 51.24%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #26102       +/-   ##
===========================================
- Coverage   91.97%   40.73%   -51.25%     
===========================================
  Files         175      175               
  Lines       52379    52412       +33     
===========================================
- Hits        48178    21349    -26829     
- Misses       4201    31063    +26862

Flag	Coverage Δ
#multiple	`?`
#single	`40.73% <ø> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`37.78% <ø> (-55.76%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...5398274. Read the comment docs.

codecov · 2019-04-27T17:14:52Z

Codecov Report

Merging #26102 into master will decrease coverage by 51.24%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #26102       +/-   ##
===========================================
- Coverage   91.97%   40.73%   -51.25%     
===========================================
  Files         175      175               
  Lines       52379    52412       +33     
===========================================
- Hits        48178    21349    -26829     
- Misses       4201    31063    +26862

Flag	Coverage Δ
#multiple	`?`
#single	`40.73% <ø> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/generic.py	`37.78% <ø> (-55.76%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...cc13046. Read the comment docs.

alexander-ponomaroff · 2019-04-27T17:21:37Z

@WillAyd Thanks for the feedback, let me know if you agree with the change now that I changed 'missing' to 'size'. If all good, I will go ahead and fix other tests that are failing due to the added functionality.

jreback

I am only +0 on adding this.

jreback · 2019-04-28T16:41:44Z

pandas/core/generic.py

@@ -9804,9 +9804,10 @@ def describe(self, percentiles=None, include=None, exclude=None):

        def describe_numeric_1d(series):
            stat_index = (['count', 'mean', 'std', 'min'] +
-                          formatted_percentiles + ['max'])
+                          formatted_percentiles + ['max', 'size'])
            d = ([series.count(), series.mean(), series.std(), series.min()] +


should be the first arg

jreback · 2019-04-28T16:43:03Z

doc/source/whatsnew/v0.25.0.rst

@@ -420,7 +420,7 @@ Other
 ^^^^^

 - Removed unused C functions from vendored UltraJSON implementation (:issue:`26198`)
-
+- Added enhancement to :func:`pd.DataFrame.describe` to include size as one of the summary statistics (:issue:`21689`)


this needs a separate subsection, its actually a fairly large API change.

jreback · 2019-05-12T21:13:50Z

can you update per comments

alexander-ponomaroff · 2019-05-19T22:10:46Z

Sorry for the long delay, I was away on vacation. I will be making the updates within the next couple of days.

jreback · 2019-06-27T03:35:13Z

closing as stale, ping if you'd like to continue.

drkarthi · 2022-10-19T22:33:54Z

@jreback @WillAyd I am interested in having the number of rows in the dataframe as an output in describe(). I assume that is what is being called size here. Is there still interest to have this added? I am happy to pick up from where @alexander-ponomaroff left off

drkarthi · 2022-10-26T19:02:59Z

@jbrockmendel @mroeschke I see you are active contributors right now. do you you have any objections to adding the number of rows in the dataframe as an output of describe()

jbrockmendel · 2022-10-29T17:55:57Z

i have no opinion on describe. if mroeschke expresses an opinion ill agree with that, otherwise ill agree with jreback at +0.

Include missing data count in pd.DataFrame.describe

1627cd3

WillAyd added API Design DataFrame DataFrame data structure and removed API Design labels Apr 15, 2019

WillAyd requested changes Apr 26, 2019

View reviewed changes

alexander-ponomaroff added 2 commits April 27, 2019 13:13

Changed missing to size

f40bad3

Merge

cc13046

Documentation fix

5398274

jreback requested changes Apr 28, 2019

View reviewed changes

jreback closed this Jun 27, 2019

Uh oh!

Include missing data count in pd.DataFrame.describe() #26102

Include missing data count in pd.DataFrame.describe() #26102

Uh oh!

Conversation

alexander-ponomaroff commented Apr 15, 2019

Uh oh!

alexander-ponomaroff commented Apr 15, 2019

Uh oh!

WillAyd commented Apr 15, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov bot commented Apr 27, 2019

Codecov Report

Uh oh!

alexander-ponomaroff commented Apr 27, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Apr 28, 2019

Choose a reason for hiding this comment

Uh oh!

jreback Apr 28, 2019

Choose a reason for hiding this comment

Uh oh!

jreback commented May 12, 2019

Uh oh!

alexander-ponomaroff commented May 19, 2019

Uh oh!

jreback commented Jun 27, 2019

Uh oh!

drkarthi commented Oct 19, 2022

Uh oh!

drkarthi commented Oct 26, 2022

Uh oh!

jbrockmendel commented Oct 29, 2022

Uh oh!

Uh oh!

codecov bot commented Apr 27, 2019 •

edited

Loading