Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include missing data count in pd.DataFrame.describe() #26102

Closed

Conversation

@alexander-ponomaroff
Copy link
Contributor

alexander-ponomaroff commented Apr 15, 2019

  • closes #21689
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
@alexander-ponomaroff

This comment has been minimized.

Copy link
Contributor Author

alexander-ponomaroff commented Apr 15, 2019

I would like to receive some feedback on the added summary statistic to describe(). It's between having missing, which counts the number of missing values, and having length, which will include missing values and if the user wants to get the missing summary, they will have to subtract count from length. I bring this up because @jreback asked for length in the issue, but the creator of the issue asked for missing. In my opinion, I think that the way I currently did it with missing is better.

Please let me know what everyone thinks, and once that's out of the way, I will proceed to change the documentation in the function to reflect the new enhancement.

@WillAyd

This comment has been minimized.

Copy link
Member

WillAyd commented Apr 15, 2019

Missing is not a method that exists in our API so I don't think it should be added here. Size seems more appropriate to me

Copy link
Member

WillAyd left a comment

Can you merge master and address comment?

@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 27, 2019

Codecov Report

Merging #26102 into master will decrease coverage by 51.24%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26102       +/-   ##
===========================================
- Coverage   91.97%   40.73%   -51.25%     
===========================================
  Files         175      175               
  Lines       52379    52412       +33     
===========================================
- Hits        48178    21349    -26829     
- Misses       4201    31063    +26862
Flag Coverage Δ
#multiple ?
#single 40.73% <ø> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 37.78% <ø> (-55.76%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...5398274. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 27, 2019

Codecov Report

Merging #26102 into master will decrease coverage by 51.24%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26102       +/-   ##
===========================================
- Coverage   91.97%   40.73%   -51.25%     
===========================================
  Files         175      175               
  Lines       52379    52412       +33     
===========================================
- Hits        48178    21349    -26829     
- Misses       4201    31063    +26862
Flag Coverage Δ
#multiple ?
#single 40.73% <ø> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 37.78% <ø> (-55.76%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...cc13046. Read the comment docs.

@alexander-ponomaroff

This comment has been minimized.

Copy link
Contributor Author

alexander-ponomaroff commented Apr 27, 2019

@WillAyd Thanks for the feedback, let me know if you agree with the change now that I changed 'missing' to 'size'. If all good, I will go ahead and fix other tests that are failing due to the added functionality.

Copy link
Contributor

jreback left a comment

I am only +0 on adding this.

@@ -9804,9 +9804,10 @@ def describe(self, percentiles=None, include=None, exclude=None):

def describe_numeric_1d(series):
stat_index = (['count', 'mean', 'std', 'min'] +
formatted_percentiles + ['max'])
formatted_percentiles + ['max', 'size'])
d = ([series.count(), series.mean(), series.std(), series.min()] +

This comment has been minimized.

Copy link
@jreback

jreback Apr 28, 2019

Contributor

should be the first arg

@@ -420,7 +420,7 @@ Other
^^^^^

- Removed unused C functions from vendored UltraJSON implementation (:issue:`26198`)

- Added enhancement to :func:`pd.DataFrame.describe` to include size as one of the summary statistics (:issue:`21689`)

This comment has been minimized.

Copy link
@jreback

jreback Apr 28, 2019

Contributor

this needs a separate subsection, its actually a fairly large API change.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented May 12, 2019

can you update per comments

@alexander-ponomaroff

This comment has been minimized.

Copy link
Contributor Author

alexander-ponomaroff commented May 19, 2019

Sorry for the long delay, I was away on vacation. I will be making the updates within the next couple of days.

@jreback

This comment has been minimized.

Copy link
Contributor

jreback commented Jun 27, 2019

closing as stale, ping if you'd like to continue.

@jreback jreback closed this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.