-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arrange() not working #56
Comments
Try adding I regularly use In addition, please ensure that the dataframe is sorted by the same key as in the "group_by" before using "group_by". This is a key difference between SQL and Pandas - grouping requires pre-sorting first or it may not work. |
@sharpe5's answer is correct, but this behavior unexpectedly diverges from dplyr's. As far as I can tell, after the Is this the intended behavior? If so, should the difference be documented? Thanks! In R with dplyr: r$> library(tidyverse)
r$> diamonds %>%
group_by(cut) %>%
summarize(count_color=n()) %>%
arrange(-count_color)
# A tibble: 5 x 2
cut count_color
<ord> <int>
1 Ideal 21551
2 Premium 13791
3 Very Good 12082
4 Good 4906
5 Fair 1610
r$> diamonds %>%
group_by(cut) %>%
summarize(count_color=n()) %>%
ungroup() %>%
arrange(-count_color)
# A tibble: 5 x 2
cut count_color
<ord> <int>
1 Ideal 21551
2 Premium 13791
3 Very Good 12082
4 Good 4906
5 Fair 1610
In Python with dfply: In [8]: (dfply.diamonds >>
...: group_by('cut') >>
...: summarize(count_color=dfply.n(X.color)) >>
...: arrange(X.count_color, ascending=False))
Out[8]:
cut count_color
0 Fair 1610
1 Good 4906
2 Ideal 21551
3 Premium 13791
4 Very Good 12082
In [7]: (dfply.diamonds >>
...: group_by('cut') >>
...: summarize(count_color=dfply.n(X.color)) >>
...: ungroup() >>
...: arrange(X.count_color, ascending=False))
Out[7]:
cut count_color
2 Ideal 21551
3 Premium 13791
4 Very Good 12082
1 Good 4906
0 Fair 1610 |
Yes this is the intended behavior. I am aware that it diverges from I am not opposed to changing it to match the Not sure which direction to go. If you think that the |
I can confirm that I got stuck, and it was only after a lot of false starts
and experimentation that I worked out that ungroup() was required.
My vote is to have dplyr behave similarly to dfply, so code can be directly
ported from R to python and back again without the need to add extra
clauses.
However - is there any technical reason why this change might be a
disadvantage? Would this prevent some problems from being solved?
…On Fri, 18 Jan 2019 at 18:49, Kiefer Katovich ***@***.***> wrote:
Yes this is the intended behavior. I am aware that it diverges from dplyr.
I understand the rationale in dplyr that summarize would eliminate the
groupings, since it is "collapsing" the groups to single rows and therefore
the groupings become "meaningless".
I am not opposed to changing it to match the dplyr behavior if that is
what the people want. My rationale for not doing this despite the
collapsing into rows is that the ungrouping becomes implicit rather than
explicit. My reasoning was that the grouping should be preserved until you
explicitly state that groupings should be collapsed into a single dataframe.
Not sure which direction to go. If you think that the dplyr way is
superior I am happy to change it to that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#56 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABOypE2T_Yy-RYUH6T-sH1euuZMToe-fks5vEhcQgaJpZM4UO4Jw>
.
|
I'd prefer that |
I'm not wedded to any particular names, but I do think it's important that different functions have different names. How would you feel about:
This way there's no confusion and I can just pick the one I want without fear, and I don't have to worry about which version of dfply I'm on. |
I am trying to calculate the summary statistics by grouping variable and then sorting the result in descending order.
Gold Medal Count (i.e. variable N) is not sorted in descending order
The text was updated successfully, but these errors were encountered: