Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort doesn't work after agg without ungroup #41

Closed
samarpan-rai opened this issue Aug 18, 2020 · 8 comments
Closed

Sort doesn't work after agg without ungroup #41

samarpan-rai opened this issue Aug 18, 2020 · 8 comments

Comments

@samarpan-rai
Copy link
Contributor

I am not sure if this is a bug but it is definitely not a feature request so I am writing it as a bug report.

Problem
I was expecting the following code to return a sorted list of the count grouped by the primary_type key but that is not the case. I see an unsorted list. Is this an expected behaviour?

from clumper import Clumper
(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .mutate(primary_type = lambda c : c['type'][0])
    .group_by('primary_type')
    .agg(occurence = ('primary_type','count'))
    .sort(key=lambda x : x['occurence'])
    .collect()
)

Additional context

Adding ungroup after agg and before sort solves it. The following code produces what I want.

from clumper import Clumper

(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .mutate(primary_type = lambda c : c['type'][0])
    .group_by('primary_type')
    .agg(occurence = ('primary_type','count'))
    .ungroup()
    .sort(key=lambda x : x['occurence'])
    .collect()
)
@samukweku
Copy link
Contributor

samukweku commented Aug 19, 2020

Since it is a grouping I expect the sort to occur within each group. For the example above, there is only one data per group, so sorting is pretty much unnecessary, hence the unsorted look.

@samarpan-rai
Copy link
Contributor Author

@samukweku Maybe I didn't fully understand the difference. Can you give me an example of within group sorting?

@samukweku
Copy link
Contributor

Try grouping by the type, after explode, and sort on the total key :

(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .explode('type')
    .group_by('type')
    .sort(key=lambda x : x['total'])
    .collect()
)

the function you used is a reducing function - you get just one value per group... no point sorting within a group with a single value. Hope the code above explains better sorting within a group

@koaning
Copy link
Owner

koaning commented Aug 19, 2020

@samarpan-rai yeah I might argue this is expected behaviour. Just to check, have you seen this guide on the docs?

The .agg operation does not remove the grouping. It's a bit of extra work but I found that in the end, groups should be dealt with explicitly. Otherwise the user might need to do a lot of accounting in their mind. One way of looking at it, it's still better than having to call .reset_index().

@samukweku just mentioning it as a by the way, but on GitHub you can get pretty syntax highlighting for python if you add this at the start of a code snippet.

```python

@samukweku
Copy link
Contributor

samukweku commented Aug 19, 2020

  • @koaning thanks for that. And to think I use it all the time when writing documentation and yet fail to use it when making comments 🤦‍♂

@koaning
Copy link
Owner

koaning commented Aug 19, 2020

@samarpan-rai just to check, can I close this issue?

@samarpan-rai
Copy link
Contributor Author

Try grouping by the type, after explode, and sort on the total key :

(
    Clumper.read_jsonl("https://calmcode.io/datasets/pokemon.jsonl")
    .explode('type')
    .group_by('type')
    .sort(key=lambda x : x['total'])
    .collect()
)

the function you used is a reducing function - you get just one value per group... no point sorting within a group with a single value. Hope the code above explains better sorting within a group

Yes, thank you! It indeed makes sense for within group sorting.

@samarpan-rai yeah I might argue this is expected behaviour. Just to check, have you seen this guide on the docs?
The .agg operation does not remove the grouping. It's a bit of extra work but I found that in the end, groups should be dealt with explicitly. Otherwise the user might need to do a lot of accounting in their mind.

I did read it shortly but seem to have missed "With groups active, still perform the sorting but only within each group." I tried to transfer my experience from R's dplyr where one can do the following to get a sorted count by some grouping variable name.

df %>%
  group_by(name) %>%
  count(name) %>%
  arrange(n)

@samarpan-rai
Copy link
Contributor Author

@samarpan-rai just to check, can I close this issue?

Yes

@koaning koaning closed this as completed Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants