Skip to content

Conversation

alexander-ponomaroff
Copy link
Contributor

@alexander-ponomaroff
Copy link
Contributor Author

I would like to receive some feedback on the added summary statistic to describe(). It's between having missing, which counts the number of missing values, and having length, which will include missing values and if the user wants to get the missing summary, they will have to subtract count from length. I bring this up because @jreback asked for length in the issue, but the creator of the issue asked for missing. In my opinion, I think that the way I currently did it with missing is better.

Please let me know what everyone thinks, and once that's out of the way, I will proceed to change the documentation in the function to reflect the new enhancement.

@WillAyd
Copy link
Member

WillAyd commented Apr 15, 2019

Missing is not a method that exists in our API so I don't think it should be added here. Size seems more appropriate to me

@WillAyd WillAyd added API Design DataFrame DataFrame data structure and removed API Design labels Apr 15, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge master and address comment?

@codecov
Copy link

codecov bot commented Apr 27, 2019

Codecov Report

Merging #26102 into master will decrease coverage by 51.24%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26102       +/-   ##
===========================================
- Coverage   91.97%   40.73%   -51.25%     
===========================================
  Files         175      175               
  Lines       52379    52412       +33     
===========================================
- Hits        48178    21349    -26829     
- Misses       4201    31063    +26862
Flag Coverage Δ
#multiple ?
#single 40.73% <ø> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 37.78% <ø> (-55.76%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...5398274. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 27, 2019

Codecov Report

Merging #26102 into master will decrease coverage by 51.24%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26102       +/-   ##
===========================================
- Coverage   91.97%   40.73%   -51.25%     
===========================================
  Files         175      175               
  Lines       52379    52412       +33     
===========================================
- Hits        48178    21349    -26829     
- Misses       4201    31063    +26862
Flag Coverage Δ
#multiple ?
#single 40.73% <ø> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 37.78% <ø> (-55.76%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...cc13046. Read the comment docs.

@alexander-ponomaroff
Copy link
Contributor Author

@WillAyd Thanks for the feedback, let me know if you agree with the change now that I changed 'missing' to 'size'. If all good, I will go ahead and fix other tests that are failing due to the added functionality.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only +0 on adding this.

@@ -9804,9 +9804,10 @@ def describe(self, percentiles=None, include=None, exclude=None):

def describe_numeric_1d(series):
stat_index = (['count', 'mean', 'std', 'min'] +
formatted_percentiles + ['max'])
formatted_percentiles + ['max', 'size'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the first arg

@@ -420,7 +420,7 @@ Other
^^^^^

- Removed unused C functions from vendored UltraJSON implementation (:issue:`26198`)

- Added enhancement to :func:`pd.DataFrame.describe` to include size as one of the summary statistics (:issue:`21689`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a separate subsection, its actually a fairly large API change.

@jreback
Copy link
Contributor

jreback commented May 12, 2019

can you update per comments

@alexander-ponomaroff
Copy link
Contributor Author

Sorry for the long delay, I was away on vacation. I will be making the updates within the next couple of days.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2019

closing as stale, ping if you'd like to continue.

@jreback jreback closed this Jun 27, 2019
@drkarthi
Copy link

@jreback @WillAyd I am interested in having the number of rows in the dataframe as an output in describe(). I assume that is what is being called size here. Is there still interest to have this added? I am happy to pick up from where @alexander-ponomaroff left off

@drkarthi
Copy link

@jbrockmendel @mroeschke I see you are active contributors right now. do you you have any objections to adding the number of rows in the dataframe as an output of describe()

@jbrockmendel
Copy link
Member

i have no opinion on describe. if mroeschke expresses an opinion ill agree with that, otherwise ill agree with jreback at +0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include missing data count in pd.Dataframe.describe method
5 participants