arrange() not working #56

deepanshu88 · 2018-05-26T11:15:04Z

I am trying to calculate the summary statistics by grouping variable and then sorting the result in descending order.

#Import Data
import pandas as pd
mydata=pd.read_csv("http://winterolympicsmedals.com/medals.csv")

#2006 Gold Medal Count 
mydata >> mask(X.Year==2006 , X.Medal =='Gold') >> group_by(X.NOC) >> summarize(N=n(X.NOC)) >> arrange(X.N, ascending=False)

Gold Medal Count (i.e. variable N) is not sorted in descending order

The text was updated successfully, but these errors were encountered:

sharpe5 · 2018-05-26T13:15:26Z

Try adding >> ungroup() before the arrange.

I regularly use group_by(), ungroup() followed by arrange(), and it works perfectly every time.

In addition, please ensure that the dataframe is sorted by the same key as in the "group_by" before using "group_by". This is a key difference between SQL and Pandas - grouping requires pre-sorting first or it may not work.

ghost · 2018-11-13T09:15:15Z

@sharpe5's answer is correct, but this behavior unexpectedly diverges from dplyr's.

As far as I can tell, after the summarize, in dplyr the groups have been eliminated, while in dfply they remain attached to the dataframe.

Is this the intended behavior? If so, should the difference be documented?

Thanks!

In R with dplyr:

r$> library(tidyverse)   

r$> diamonds %>%  
        group_by(cut) %>%  
        summarize(count_color=n()) %>% 
        arrange(-count_color)                                                                                                                                                                                        
# A tibble: 5 x 2
  cut       count_color
  <ord>           <int>
1 Ideal           21551
2 Premium         13791
3 Very Good       12082
4 Good             4906
5 Fair             1610


r$> diamonds %>%  
        group_by(cut) %>%  
        summarize(count_color=n()) %>% 
        ungroup() %>% 
        arrange(-count_color)                                                                                                                                                                                        
# A tibble: 5 x 2
  cut       count_color
  <ord>           <int>
1 Ideal           21551
2 Premium         13791
3 Very Good       12082
4 Good             4906
5 Fair             1610

In Python with dfply:

In [8]: (dfply.diamonds >>  
   ...:     group_by('cut') >>  
   ...:     summarize(count_color=dfply.n(X.color)) >>  
   ...:     arrange(X.count_color, ascending=False))                                                                                                                                                                 
Out[8]: 
         cut  count_color
0       Fair         1610
1       Good         4906
2      Ideal        21551
3    Premium        13791
4  Very Good        12082


In [7]: (dfply.diamonds >>  
   ...:     group_by('cut') >>  
   ...:     summarize(count_color=dfply.n(X.color)) >>  
   ...:     ungroup() >>  
   ...:     arrange(X.count_color, ascending=False))                                                                                                                                                                 
Out[7]: 
         cut  count_color
2      Ideal        21551
3    Premium        13791
4  Very Good        12082
1       Good         4906
0       Fair         1610

kieferk · 2019-01-18T18:48:48Z

Yes this is the intended behavior. I am aware that it diverges from dplyr. I understand the rationale in dplyr that summarize would eliminate the groupings, since it is "collapsing" the groups to single rows and therefore the groupings become "meaningless".

I am not opposed to changing it to match the dplyr behavior if that is what the people want. My rationale for not doing this despite the collapsing into rows is that the ungrouping becomes implicit rather than explicit. My reasoning was that the grouping should be preserved until you explicitly state that groupings should be collapsed into a single dataframe.

Not sure which direction to go. If you think that the dplyr way is superior I am happy to change it to that.

sharpe5 · 2019-01-18T20:17:34Z

I can confirm that I got stuck, and it was only after a lot of false starts and experimentation that I worked out that ungroup() was required. My vote is to have dplyr behave similarly to dfply, so code can be directly ported from R to python and back again without the need to add extra clauses. However - is there any technical reason why this change might be a disadvantage? Would this prevent some problems from being solved?

…

On Fri, 18 Jan 2019 at 18:49, Kiefer Katovich ***@***.***> wrote: Yes this is the intended behavior. I am aware that it diverges from dplyr. I understand the rationale in dplyr that summarize would eliminate the groupings, since it is "collapsing" the groups to single rows and therefore the groupings become "meaningless". I am not opposed to changing it to match the dplyr behavior if that is what the people want. My rationale for not doing this despite the collapsing into rows is that the ungrouping becomes implicit rather than explicit. My reasoning was that the grouping should be preserved until you explicitly state that groupings should be collapsed into a single dataframe. Not sure which direction to go. If you think that the dplyr way is superior I am happy to change it to that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOypE2T_Yy-RYUH6T-sH1euuZMToe-fks5vEhcQgaJpZM4UO4Jw> .

ghost · 2019-01-18T20:31:46Z

I'd prefer that summarize() retain the ungrouping behavior of dplyr, and that if a group-retaining version is desired, that it have a different name. Since I regularly use both dplyr and dfply, it makes me nervous to have the same function name with substantially different behavior. I am worried about confusing them and getting wrong results without noticing.

jtrakk · 2020-01-21T19:44:28Z

I'm not wedded to any particular names, but I do think it's important that different functions have different names.

How would you feel about:

adding a function with a new name, digest(), that does ungrouping aggregation,
adding a function with a new name, review(), that does aggregation retaining groups, and
removing the summarize() function to eliminate that source of confusion with dplyr.

This way there's no confusion and I can just pick the one I want without fear, and I don't have to worry about which version of dfply I'm on.

danielsjf mentioned this issue Sep 7, 2020

Issue arranging data after summarising with a new variable #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arrange() not working #56

arrange() not working #56

deepanshu88 commented May 26, 2018

sharpe5 commented May 26, 2018 •

edited

Loading

ghost commented Nov 13, 2018

kieferk commented Jan 18, 2019

sharpe5 commented Jan 18, 2019 via email

ghost commented Jan 18, 2019

jtrakk commented Jan 21, 2020

arrange() not working #56

arrange() not working #56

Comments

deepanshu88 commented May 26, 2018

sharpe5 commented May 26, 2018 • edited Loading

ghost commented Nov 13, 2018

kieferk commented Jan 18, 2019

sharpe5 commented Jan 18, 2019 via email

ghost commented Jan 18, 2019

jtrakk commented Jan 21, 2020

sharpe5 commented May 26, 2018 •

edited

Loading